Using filesystem to have better caching via VFS

we did some quick tests, with adding 2 more nodes to existing cluster of ASDB (ASDB = Aerospike with persistence, not in-memory storage ) on two identical servers but with different storage types - 099 with FileSystem, 056 as RAW BLOCK DEVICE, as recommended. Nodes have 128GB RAM, while dataset per node is only ~ 100GB: Graphs from Zabbix are below (combined image, as AS forum doesn’t allow upload more than one )

1st spike is from nodes have been introduced to cluster.

There are several observations from this (measuring CPU iowait value):

  1. once dataset is in memory, there are almost no reads - line is still for 099
  2. once the manual flush of cache was issued for 099 (via echo 3 > /proc/sys/vm/drop_caches ), around 04:40, 15.09.2017, system started to use drives to read data/refill cache, but still lower than node without cache (0.5% no cache vs 0.25% with cache); this leads to idea that we have some “hot” part of the data being served from cache, which helps
  3. time to refill cache / make line steady has taken around 3.5 hours → which also means we are mostly working with subsets of data, not everything at once
  4. spikes around 3AM/20PM for 15.09.2017 are from stress tests - cache node clearly wins here, no influence at all.

We will remove two other nodes soon, thus making dataset-per-node bigger and i’ll update the results.

Meanwhile, on staging, will switch everything to FS type to find possible problems.