Linux RAID-1 with write-mostly vs. bcache for AWS setup with persistance and SSD caching

You recommend to use bcache to get SSD advantages and EBS persistence in AWS.

But there is another way: linux RAID-1 with EBS marked as --write-mostly. In this case write operations will go to both SSD and EBS and all reads go to SSD if it is available.

Why you do not describe this way? Does it have some disadvantages? Our quick test shows that it is better than bcache, because with bcache we often have un-cached data.

Note, that linux kernel have a bug in write-mostly option implementation. It was fixed in linux-3.19.1.

Hi,

with bcache we often have un-cached data.

Are you using the EBS of the same size as ephemeral device? Please note that aerospike docs recommend the caching device (ephemeral disk) to be of same size as backing device (EBS) for this particular reason to avoid cache miss.

As to bcache vs raid, bcache is a suggested solution as its easier to setup. If you find the RAID solution to be better for your use and performance requirement, you could share the experience with community.

In bare metal scenarios, the overhead of a RAID card/setup is often detrimental to performance and hence its not usally suggested.

Are you using the EBS of the same size as ephemeral device? Please note that aerospike docs recommend the caching device (ephemeral disk) to be of same size as backing device (EBS) for this particular reason to avoid cache miss.

Yes. We have used same size disk, we have had cache misses. They gone only if backing device has ~75% capacity of cache device. As I understand it because of bcache overhead - bcache maintains B+ tree index and metadata and store this on cache device. So I do not think that it even theoretically it is possible to avoid cache misses using the same size cache.

Ah, good point. I would let aerospike test and update their docs on this (cc @meher )

But please note that the suggested hwm for disk is 50%-60%. That should keep you within the bcache available space. At no point of time from aerospike perspective would you be using 100% of your actual storage, and if you are doing so, you would have problems depending upon amount of avail pct.

At avail pct <=5, stop writes will trigger and at avail pct 0, you will have to remove some data to get the disk back in play.

Coming back to bcache, how are you testing the bcache cache hit? Are you using aerospike or are you using some other generic write and test method?

Initially we did tests with Aerospike. In one test we have had 4 Aerospike nodes with 80% of disk filled in (bcache with same size backing disk). We heavy load Aerospike nodes by read operations. And add 5th node. The goal was to see how Aerospike behave during re-balancing. We found significant performance degradation and plus read operations from EBS. The last was surprise for us, because we expect complete caching. Before re-balancing we did not have any read operation, but we query just subset of data, and it was completely cached. But looks like re-balancing start reading whole data and produce read operations on EBS.

So, after it, we did deep investigation on bcache without Aerospike. Found that we are able to completely avoid read operation only with 75% backing disk capacity. Even in this case to cache whole data we need to read all data several times (3-4). Each operation reduce amount of reads. And on 4-5 time they are completely gone.

RAID-1 promise to resolve all this problems. It would be not a cache, but complete plain mirror. md kernel driver can be configured to read from EBS only if SSD drive is unavailable. In other case all read operation go to SSD, which is good. Unfortunately all modern kernels have a bug in this functionality. Only latest kernel have fix, so we need to wait while fix will be available in Ubuntu (which we use for our production).

There is a stop writes triggered at avail_pct <= 5, but legacy issues @anshu describes about running avail_pct to 0 were mitigated since at least 3.3.5. The behavior since 3.3.5 to 3.5.7 was to allocate 8 wblocks that are only accessible during the cold/fast startup phase, but cannot be used when the service is live. This enabled us to recover from avail_pct 0 with a server process restart and no further action from the user.

As of 3.5.8 Aerospike reserves a 4 wblocks that only defrag can use and 4 that only restart can use–so in the latest versions restarting a the server is no longer required.

KVS [AER-3125] - run-time recovering option for out-of-storage situation. On increasing defrag-lwm-pct, queue all newly eligible wblocks for defrag. Also, allow defrag to use reserve wblocks when other writer threads can’t.

We have actually found some issues with the bcache solutions, similar to what is being pointed out. We are still investigating best way to get this enabled as well as alternate solutions and will update when possible. Thanks for your input regarding the RAID and write-mostly option, will definitely pass it on internally.

Cloud you please confirm that (KVS) [AER-3557] (Add device shadowing functionality, for persistence on network devices in addition to ephemeral storage) in Aerospike 3.5.12 (May 28 2015) is the suggested alternative for bcache (in cloud environments)?

How does it’s performance compare to the bcache based solution?

Also isn’t it possible that the bcache bug is already fixed (just not being present in all distros)?

Thanks for your question. Yes, this is a suggested alternative for bcache.

From my understanding, bcache issues were not limited to the bug (which is likely fixed in some distros as you point out) but also was letting some transaction hit the underlying storage (EBS), regardless how much the primary (attached ephemeral SSD) was oversized.

We are still doing some performance tests. I would expect performance to be quasi-identical to performance when only ephemeral SSD is used. Would of course depend on the workload, and in general, we focus on read heavy workloads.

We will anyway make sure to publish benchmark details for different workloads.

–meher