Redis with an SSD swap, not what you want

antirez 4160 days ago. 268580 views.

Hello! As promised today I did some SSD testing.

The setup: a Linux box with 24 GB of RAM, with two disks.

A) A spinning disk.
b) An SSD (Intel 320 series).

The idea is, what happens if I set the SSD disk partition as a swap partition and fill Redis with a dataset larger than RAM?
It is a lot of time I want to do this test, especially now that Redis focus is only on RAM and I abandoned the idea of targeting disk for a number of reasons.

I already guessed that the SSD swap setup would perform in a bad way, but I was not expecting it was *so bad*.

Before testing this setup, let's start testing Redis in memory with in the same box with a 10 GB data set.

IN MEMORY TEST
===

To start I filled the instance with:

./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo

Write load in this way is very high, more than half million SET commands processed per second using a single core:

instantaneous_ops_per_sec:629782

This is possible because we using a pipeline of 32 commands per time (see -P 32), so it is possible to limit the number of sys calls involved in the processing of commands, and the network latency component as well.

After a few minutes I reached 10 GB of memory used by Redis, so I tried to save the DB while still sending the same write load to the server to see what the additional memory usage due to copy on write would be in such a stress conditions:

[31930] 07 Mar 12:06:48.682 * RDB: 6991 MB of memory used by copy-on-write

almost 7GB of additional memory used, that is 70% more memory.
Note that this is an interesting value since it is exactly the worst case scenario you can get with Redis:

1) Peak load of more than 0.6 million writes per second.
2) Writes are completely distributed across the data set, there is no working set in this case, all the DB is the working set.

But given the enormous pressure on copy on write exercised by this workload, what is the write performance in this case while the system is saving? To find the value I started a BGSAVE and at the same time started the benchmark again:

$ redis-cli bgsave; ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
Background saving started
^Ct key:rand:000000000000 foo: 251470.34

250k ops/sec was the lower number I was able to get, as once copy on write starts to happen, there is less and less copy on write happening every second, and the benchmark soon returns to 0.6 million ops per second.
The number of keys was in the order of 100 million here.

Basically the result of this test is, with real hardware and persisting to a normal spinning disk, Redis performs very well as long as you have enough RAM for your data, and for the additional memory used while saving. No big news so far.

SSD SWAP TEST
===

For the SSD test we still use the spinning disk attached to the system in order to persist, so that the SSD is just working as a swap partition.

To fill the instance even more I just started again redis-benchmark with the same command line, since with the specific parameters, if running forever, it would set 1 billion keys, that's enough :-)

Since the instance has 24 GB of physical RAM, for the test to be meaningful I wanted to add enough data to reach 50 GB of used memory. In order to speedup the process of filling the instance I disabled persistence for some time using:

CONFIG SET SAVE ""

While filling the instance, at some point I started a BGSAVE to force some more swapping.
Then when the BGSAVE finished, I started the benchmark again:

$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
^Ct key:rand:000000000000 foo: 1034.16

As you can see the results were very bad initially, probably the main hash table ended swapped. After some time it started to perform in a decent way again:

$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 set key:rand:000000000000 foo
^Ct key:rand:000000000000 foo: 116057.11

I was able to stop and restart the benchmark multiple times and still get decent performances on restarts, as long I was not saving at the same time. However performances continued to be very erratic, jumping from 200k to 50k sets per second.

…. and after 10 minutes …

It only went from 23 GB of memory used to 24 GB, with 2 GB of data set swapped on disk.

As soon as it started to have a few GB swapped performances started to be simply too poor to be acceptable.

I then tried with reads:

$ ./redis-benchmark -r 1000000000 -n 1000000000 -P 32 get key:rand:000000000000
^Ct key:rand:000000000000 foo: 28934.12

Same issue, 30k ops per second both for GET and SET, and *a lot* of swap activity at the same time.
What's worse is that the system was pretty unresponsive as a whole at this point.

At this point I stopped the test, the system was slow enough that filling it even more would require a lot of time, and as more data was swapped performances started to get worse.

WHAT HAPPENS?
===

What happens is simple, Redis is designed to work in an environment where random access of memory is very fast.
Hash tables, and the way Redis objects are allocated is all based on this concept.

Now let's give a look at the SSD 320 disk specifications:

Random write (100% Span) -> 400 IOPS
Random write (8GB Span) -> 23000 IOPS

Basically what happens is that at some point Redis starts to force the OS to move memory pages between RAM and swap at *every* operation performed, since we are accessed keys at random, and there are no more spare pages.

CONCLUSION
===

Redis is completely useless in this way. Systems designed to work in this kind of setups like Twitter fatcache or the recently announced Facebook McDipper need to be SSD-aware, and can probably work reasonably only when a simple GET/SET/DEL model is used.

I also expect that the pathological case for this systems, that is evenly distributed writes with big span, is not going to be excellent because of current SSD disk limits, but that's exactly the case Redis is trying to solve for most users.

The freedom Redis gets from the use of memory allows us to serve much more complex tasks at very good peak performance and with minimal system complexity and underlying assumptions.

TL;DR: the outcome of this test was expected and Redis is an in-memory system :-)