Redis PSYNC2 bug post mortem

antirez 2428 days ago. 137116 views.

Four days ago a user posted a critical issue in the Redis Github repository. The problem was related to the new Redis 4.0 PSYNC2 replication protocol, and was very critical. PSYNC2 brings a number of good things to Redis replication, including the ability to resynchronize just exchanging the differences, and not the whole data set, after a failover, and even after a slave controlled restart. The problem was about this latter feature: with PSYNC2 the RDB file is augmented with replication information. After a slave is restarted, the replication metadata is loaded back, and the slave is able to perform a PSYNC attempt, trying to handshake with the master and receive the differences since the last disconnection.

All this is good news from the point of view of Redis operations, however while PSYNC2 was pretty solid since the introduction in Redis 4.0.0 stable, the feature involving a restarting slave was definitely lacking reliability. There were two problems with this feature: the first being that it was a last-minute addition to PSYNC2, and was not part of the original design document. It was more like an obvious extension of the work we did in PSYNC2, but was not scrutinized at the same level of the rest of the specification for potential bugs and issues. The second problem was due to the fact that the feature is more complex than it looks like initially, for the fact that it’s tricky to really restore *all* the state the replication has in the slave side after a restart. Moreover, failing to restore certain bits of the state, does not result in evident bugs most of the times, so they are hard to spot via integration testings. Only once specific conditions happen the lack of some state will result in problems. For instance failing to correctly reconstruct the currently selected DB in the slave replication state, will create problems only when there are writes happening in different Redis DBs, and such bug would not store correctly the currently selected DB only under special conditions.

Thanks to the help of many Redis contributors, but especially thanks to the @soloestoy Github user, a Redis developer at Alibaba, recently we worked at improving a number of PSYNC2 potential issues. All the work done would finally end inside Redis 4.0.3 in a few days, after much testing of all the patches wrote so far. However once I received issue #4483, I understood I had to rush the release of Redis 4.0.3 because this was far more serious than the other Redis 4 PSYNC2 issues that we had discovered up to that point.

The bug described in issue #4483 was fairly simple, yet very problematic. After a slave restart and a resulting reloading of the replication state from the RDB, the master replication backlog could contain Lua scripts executions in the form of EVALSHA commands. However the slave Lua scripting engine is flushed of all the scripts after the restart, so is not able to process such commands. This results in the slave not processing writes originated by Lua scripts, unless such scripts were using “commands replication”, which is not the default. The default is to replicate the script itself.

I went a bit in panic mode… and wrote a few alternative patches that I submitted to the issue. At the end the one that did not require RDB incompatibilities with older versions was picked, so I went ahead and released Redis 4.0.3 ASAP. However I did a mistake… It’s a few weeks that I work from an office instead of working from home. Normally I do not work at night, but for 4.0.3, when I returned back home, I opened my home laptop and merged the patch into the 4.0 branch, and did some testing activity. The next day when I returned to the office, I continued working from another computer, but I did not realize that I was missing one commit that I merged at home, without pushing it to the repository.

So I basically released a 4.0.3 that had all the PSYNC2 fixes minus the most important one for the replication bug. I released ASAP a new patch level version, Redis 4.0.4, including the replication fix. This was already not great: upgrading Redis is a scheduled activity, and nobody wants to upgrade two times because I make errors in preparing the release… But the worst was yet to happen. The fix that I added in 4.0.4, the one about scripts replication in restarting slaves performing PSYNC2, had an error that passed totally unnoticed through all the replication integration tests: the fix involved storing the Lua scripts in the slave memory directly into the RDB, in order to reload them later. However it was not considered that the function loading the scripts, would assert if the script was already in memory, so when a slave received a full synchronization from a master, it started to load the RDB file, crashing immediately because of some duplicated script that was already in memory. A user reported it ASAP via Twitter, and I fixed the problem in less than 45 minutes, from the reporting moment to the 4.0.5 availability, but this does not mitigate all the potential problems that this sequence of failures at delivering a fix caused.

The above problems were caused by a number of reasons:

1. Redis 4.0 was delivered as stable too early, considering the complexity and the amount of code change in PSYNC2. The next time, even at the cost of delaying releases, I’ll wait more and will take the new major versions of Redis more time in the last release candidate, so that we can find such bugs before shipping a GA version.

2. I rushed too much trying to deliver a fix for issue #4483. Even if the bug was critical, it was better to spend some time in order to check what was happening, and the potential problems caused by the fix itself. Moreover I was so much in a hurry that I did not check the set of commits composing 4.0.3 with enough care, causing the lack of one of the fixes, and the need for a new release.

3. While Redis 4.0 has very strict PSYNC2 unit and integration tests, including tests that simulate continuous failovers checking data consistency at every step, because of this bug I discovered that the tests never try to mix PSYNC2 with replication of Lua scripts. This must be improved. Also the PSYNC2 replication after slave RDB restart must be enhanced as well.

I’ll start with the improvements at “3”, and in the future I’ll consider the lessons at “1” and “2” before every new release. My sincere apologize to everybody that had to deal with my errors in this instance, and a big thank you to @soloestoy for an incredible amount of help and support.