Using filesystem to have better caching via VFS

Did anyone consider using filesystem to make VFS work as a cache layer? Let’s imaging dataset is 500GB, RAM - 128GB. As i understand, right now, with using raw block devices, we completely unuse the memory (except some metadata, of course). data-in-memory says it will Keep a copy of all data in memory always, and memory is not enough for dataset. I suppose for read-intensive workloads, this may have some gains. Would like to hear your opinions.

So you’re saying to take advantage of the kernel’s file system caching ability in order to utilize more RAM and potentially speed up Aerospike? Definitely not something I have tried… interesting though. I’m not sure that the kernel would be smart enough to cache aerospike reads, but if it is maybe it would have benefits as you said. For my use case we use SSD’s and there is no latency in reading from disk, maybe if you are using spinning disks it would make sense…

I wouldn’t say about latency, but about disk utilisation - right now observing ~ 70-80% of disk utilisation according to iostat. Of course, adding more drives/servers is an option, but why not to try use RAM for this, it’s just idling anyway…

Heck, try it out :slight_smile:

we did some quick tests, with adding 2 more nodes to existing cluster of ASDB (ASDB = Aerospike with persistence, not in-memory storage ) on two identical servers but with different storage types - 099 with FileSystem, 056 as RAW BLOCK DEVICE, as recommended. Nodes have 128GB RAM, while dataset per node is only ~ 100GB: Graphs from Zabbix are below (combined image, as AS forum doesn’t allow upload more than one )

1st spike is from nodes have been introduced to cluster.

There are several observations from this (measuring CPU iowait value):

  1. once dataset is in memory, there are almost no reads - line is still for 099
  2. once the manual flush of cache was issued for 099 (via echo 3 > /proc/sys/vm/drop_caches ), around 04:40, 15.09.2017, system started to use drives to read data/refill cache, but still lower than node without cache (0.5% no cache vs 0.25% with cache); this leads to idea that we have some “hot” part of the data being served from cache, which helps
  3. time to refill cache / make line steady has taken around 3.5 hours → which also means we are mostly working with subsets of data, not everything at once
  4. spikes around 3AM/20PM for 15.09.2017 are from stress tests - cache node clearly wins here, no influence at all.

We will remove two other nodes soon, thus making dataset-per-node bigger and i’ll update the results.

Meanwhile, on staging, will switch everything to FS type to find possible problems.

So it seems to help!? Very cool

TLDR; We found file system to be really bad when the system is running low on memory

Details… We actually experimented for the same reason - to see if we can get free read-cache. We tested in the past comparing raw SSD, ext3 and ext4 filesystems. While ext3 was a bit poor than raw SSD, ext4 was better than raw SSD (a bit inconsistent but better). But the fun ends there!

While the short-term benchmarks show everything in good light, there were some really bad nightmares when our customers used filesystems it in production. Since then we are recommending against filesystem.

Without going into too much detail, the fundamental issue is that the OS/filesystem is not very good at handling dirty memory pages when it is already low on memory resource. Seems it gets really desperate/aggressive to reclaim memory resource (memory compaction) and some of it happens under kernel locks. So, when the OS is busy doing those things it blocks the process. Our aerospike latency characteristics which are very nice otherwise took a real beating. We also observed impact on network efficiency. We think its because even network communication needs memory buffers per connection and are slowed down due to the same issue. There are few instnaces where the OOM manager of Linux ended up killing aerospike process as it is one of the biggest consumers of memory. Obviously, it cannot kill the filesystem. So, it killed aerospike.

We could clearly see some kernel memory compaction functions (like isolate_freepages_block) taking too much CPU time when we profiled the process which is in distress. See discussions on this function going into high CPU consumption - LKML: "Jim Schutt": excessive CPU utilization by isolate_freepages? ; http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-November/044895.html. We tried using /proc/sys/vm/drop_caches. We ran it almost every 5 mins. It provided some cushion but it did not really help.

2 Likes

Thanks for sharing your experience, will try to take a look deeper.