Twilio incident and Redis

antirez 4021 days ago. 262315 views.

Twilio just released a post mortem about an incident that caused issues with the billing system:

http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html

The problem was about a Redis server, since Twilio is using Redis to store the in-flight account balances, in a master-slaves setup, with multiple slaves in different data centers for obvious availability and data safety concerns.

This is a short analysis of the incident, what Twilio can do and what Redis can do to avoid this kind of issues.

The first observation is that Twilio uses Redis, an in memory system, in order to save balances, so everybody will say "WTF Twilio! Are you serious with your data?". Actually Redis uses memory to serve data and to internally manipulate its data structures, but the incident has *nothing to do* with the durability of Redis as a DB. In fact Twilio stated that they are using the append only file that can be a very durable solution as explained here: http://oldblog.antirez.com/post/redis-persistence-demystified.html

The incident is actually centered around two main aspects of Redis:

1) The replication system.
2) The configuration.

I'll address they two things respectively.

Analysis of the replication issue
===

Redis 2.6 always needs a full resynchronization between a master and a slave after a connection issue between the two.
Redis 2.8 addressed this problem, but is currently a release candidate, so Twilio had no way to use the new feature called "partial resynchronization".

Apparently the master became unavailable because many slaves tried to resynchronize at the same time.

Actually for the way Redis works a single slave or multiple slaves trying to resynchronize should not make a huge difference, since just a single RDB is created. As soon as the second slave attaches and there is already a background save in progress in order to create the first RDB (used for the bulk data transfer), it is put in a queue with the previous slave, and so forth for all the other slaves attaching. Redis will just produce a single RDB file.

However what is true is that Redis may use additional memory with many slaves attaching at the same time, since there are multiple output buffers to "record" to transfer when the RDB file is ready. This is true especially in the case of replication over WAN. In the Twilio blog post I read "multiple data centers" so it is possible that the replication process may be slow in some case.

The bottom line is, Redis normally does not need to go slow when multiple slaves are resynchronizing at the same time, unless something strange happens like hitting the memory limit of the server, with the master starting to swap and/or problems with very slow disks (probably EC2?) so that creating an RDB starts to mess with the ability to write to the AOF file.

However issues writing to the AOF are a bit unlikely to be the cause, since during the AOF rewrite there is the same kind of disk i/o stress, with one thread writing a lot of data to the new AOF, and the other (main) thread logging every new write to the AOF. Everything considered memory pressure seems more probable, but Twilio engineers can just comment with details about what happened, this will be an useful real-world data point for sure.

From the Twilio side, what is possible to do to minimize incidents, is to understand exactly why the master is not able, with the current architecture, to survive without serious loss of performance to many slaves resynchronizing.

From the Redis side, well, we had to do our homework and provide partial resynchronization *long time ago* probably, we finally have it in Redis 2.8, and it is very good that a few days ago I pushed forward the 2.8 release skipping all the other pending features for this release that will be postponed for the next release. Now we have the first release candidate, in a few weeks this should be a release in the hands of users.

The configuration
===

The other obvious problem, probably the biggest one, was restarting the master with the wrong configuration.

Again I think here there was an human error that was "helped" by a Redis non perfect mechanism.

Basically up to Redis 2.6 you had CONFIG SET to change the configuration by hand, so it was possible for example to switch the system from RDB to AOF for more data safety with just:

redis-cli CONFIG SET appendonly yes

However you had to change the configuration file manually in order to ensure that the change will affect the instance after the next restart. Otherwise the change is only in the current in memory configuration and a restart will bring you back to the old config.

Maybe this was not the case, but it is not unlikely that Twilio engineers modified the wrong redis.conf file or forgot to do it in some way.

Fortunately Redis 2.8 provides a better workflow for on-the-fly configuration changes, that is:

redis-cli CONFIG SET appendonly yes
redis-cli CONFIG REWRITE

Basically the config rewriting feature will make sure to change the currently used configuration file, in order to contain the configuration changes operated by CONFIG SET, which is definitely safer.

In the end
===

I'll be happy to work with the Twilio engineers in the next weeks in order to understand the details and their requests and see how Redis can be improved to make incidents like this less likely to happen.

A real world test
===

I just tried to setup a master with AOF enabled, rotating disks, and a huge write load. Only trick is, it is bare metal entry-level hardware.

Then I put a steady load on it of 70k writes per second across 10 millions of keys.

Finally I tried to mass-resync four slaves form scratch multiple times.

Results:

$ redis-cli -h 192.168.1.10 --latency-history
min: 0, max: 26, avg: 0.97 (1254 samples) -- 15.00 seconds range
min: 0, max: 5, avg: 0.66 (1287 samples) -- 15.00 seconds range
min: 0, max: 2, avg: 0.62 (1290 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1306 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1310 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.45 (1311 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1305 samples) -- 15.01 seconds range
min: 0, max: 23, avg: 0.49 (1306 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 36, avg: 0.86 (1255 samples) -- 15.00 seconds range
min: 0, max: 6, avg: 1.05 (1246 samples) -- 15.01 seconds range
min: 0, max: 21, avg: 0.52 (619 samples)^C

As you can see there is no moment in which the server struggles with this load. During the test the load continued to be accepted at the rate of 70k writes/sec.

This test is in no way able to simulate the Twilio architecture, but the bottom line here is, Redis is supposed to handle this well with minimally capable hardware so something odd happened, or there was a low memory condition, or there was the "EC2 effect", that is, some very poor disk performance allowed for memory pressure.