Persistency


#1

Hi, I need immediate write persistency after returning from aerospike_key_put. What is the probability that my data is on disk when after placing it in Streaming Write Buffer kernel panic happens? Regards, Jakub


#2

By default we wait 1 seconds after a Streaming Write Buffer has been allocated before flushing the SWB to disk. This can be tuned tighter configuring a lower flush-max-ms, though with replication factor 2 it is expected to be unlikely that 2 nodes kernel panic at the same time.

In addition to this if the disk is not keeping up with the write load, there may be multiple SWBs pending to be written, The amount of data that can be pending is configurable through the max-write-cache which by default is 64 MB; max-write-cache enables Aerospike to handle temporary write bursts which exceed the storage layers IO capacity.


#3

To put it in perspective, no No-sql products synchronously writes to disk. Everyone does some or the other form of buffered writes (with different intervals of fsync). This is in the interest of latency and throughput. Traditional RDBMS does sync write to disk to guarantee durability. But distributed systems solves this by writing duplicate copies in other nodes. As @kporter already mentioned, the chance of two nodes going down exactly at the same time is less.


#4

Jakub, Please notice that we are doing direct device access, and using O_SYNC. This means that the size of the buffer lost is predictable (128K or 1M depending on configuration).

And, there is a second server. The loss would only be if the second (or third) server fails within this window.

In general, we have found most architectures recognize that in a case of multiple server failures within the same millisecond, there are likely to be application server failures at the same time. The server you report to in the case of multiple simultaneous server panics has probably also failed.

Thus, the server you are reporting “written” to will likely also have failed, or will see a “socket closed” error - which is an indeterminate case.

This “deep result” is described by Eric Brewer in “why banks don’t use ACID” - http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why-banks-are-base-not-acid-availability.html .

Although we intend to move to higher levels of consistency, we find that most applications do best at READ_COMMITTED, which is our level of ACID.


#5

I’m during proof of concept to use Aerospike as a storage for some JMS server. It may sound stupid, but I will try to explain it later. To have valid support for JMS server I need to meet following:

Table 79 Shared Storage Criteria for Fault Tolerance Criterion Description … Synchronous Write Persistence - Upon return from a synchronous write call, the storage solution guarantees that all the data have been written to durable, persistent storage.

Let assume that probability of failure of one Aerospike node is 5% (very pessimistic estimation). If we have 3 nodes and they are independent, probability for failure of Aerospike cluster is 0.0125%. Availability for 5 nodes is 99.99996875%. With this number we have got rather high confidence level that Aerospike cluster is ‘durable, persistent storage’. Changing words ‘have been written’ to ‘is written’ would satisfy required condition at the underlying device level. We can argue to the vendor that ‘storage solution’ is Aerospike as a whole, and we comply.

Now, about stupid proof of concept. There is enterprise grade JMS server from (well known by Gartner) company called Tibco. EMS server is available for Windows, Linux, Solaris, AX, HP-UX and maybe for some other Unices. It is written in C/C++. It can use clustered file system to achieve high availability, it cannot be distributed, partitioned or replicated across data centers with out of box tools. But it should, it 2015 now, not late 90s. I have written small library shadowing glibc to trace I/O (http://1307723433353.blogspot.com/2014/12/ems-ds-tracer.html). It was needed to learn about EMS behaviour. EMS can use single file for all queues/topics/etc with usually linear writes at offsets aligned to 512 bytes. I have written AIO copying (http://1307723433353.blogspot.com/2014/12/how-to-copy-ems-datastore-for-interdc.html) tool, but it needs 4-9 seconds to copy 5GB file from RedHat’s GFS2 to local EXT4. 9 seconds is too much for maintaining consistency of copied data. Having nice LD_PRELOAD library I have started proof of concept to replace read/write glibc calls with functions provided by Aerospike C client. Storing file chunks inside NoSQL may sound stupid. I expected that PoC will be a total failure with a lot of wasted work to translate EMS activities, but after 2 days I managed to set up EMS ready for heavy load tests. Code of LD_PRELOAD library is here: https://drive.google.com/open?id=0B-f1Z0bEJlvQZ3prNk5xdzEybFU&authuser=0. It is quick and dirty, but Aerospike as a storage with TCP/IP overhead is still faster than file (http://1307723433353.blogspot.com/2015/02/tibco-ems-on-aerospike-nosql-cluster.html). I haven’t decided what to do with this unexpected success. If it is the right way to go I would need to make LD_PRELOAD library fully functional/feature complete, do extensive validation/crash tests and learn more about Aerospike. Writing myself distributed/replicated storage with performance optimizations would take a lot of time, but I also consider this. HA/FT is a priority.


#6

Jakub, is there anything we can help with?


#7

Currently no. Thanks for all your replies.


#8

I have found quite serious synchronization bugs inside thr_info.c, that bring whole server down with SIGSEGV. Have you ever done static code analysis to feel more comfortable that all threads do what intended?


#9

Jakub, could you open another forum item and share what you’ve found? Of course sample code with a crash is best. thr_info.c has the most straightforward locking system as it’s not meant to be particularly fast.

Yes, we have used static analysis tools. We spent several engineering months, and it found no actual bugs, only false alarms — one after another. Static analysis tools tend not to understand reference counted multi-threaded C.


#10

Already done: Segmentation Fault of Server. I couldn’t wait for fixed server package and started checking source code to fix it myself. I was shocked that this nasty bug, visible in a obvious way lived unnoticed in a production code. I’m slightly afraid of building my solution based on Aerospike.


#11

As noted, this has been fixed. Thanks for your contribution.