Different clustering/replication design in your products, but have you folks or any of your larger users run similar tests on Aerospike operating at scale?
We do occasionally have nodes failing in user clusters due to failing disks, network glitches, etc… (and bugs of course, albeit rare ) . This is not considered an event in most use cases, especially if the cluster has been correctly sized. The data just rebalances across the remaining nodes (what we call migrations) and clients do get a updated partition map within a second of a cluster change, and hence being informed of the new owner for each partition (see details here: http://www.aerospike.com/docs/architecture/clustering.html#what-happens-when-a-node-fails )
Also, our regular test scenarii include cluster perturbations. Typically, while under load, removing a node, adding it back with data, adding a new empty node, cutting network on a node and simulating network congestions.
Meher, thanks for your response. Would love to see something more formal sometime of such ‘cluster perturbations’ on say, a 4-5 node cluster operating at that 2-300k TPS level of scale, and hear how it behaves. I’m guessing a most folks here would…
Let me see what I can do… And this is actually very easy to do… just need to find the time and machines
Meher, no specific rush. I personally like most of the architectural, design and usability decisions you folks have made - you certainly can’t argue with the fact you’ve made using Aerospike a very pleasant experience. But while real performance numbers from big clients are great and help sell on the performance side of things, reliability is now somewhat ‘front and center’ in the wake of this ‘redis kill -9’ business. Folks considering going ‘all in’ are likely to now want to have at least one [formal] at-scale reliability test they can reference in the course of their own internal or external sell to use your technology.
Thanks for the feedback! I am on the same page as you on this one. Let me look internally to see if we have something available to publish, otherwise, I will definitely follow up to have such reliability test reference published.
Meher, that would be great. Look forward to it. Also, setting it up, running it, and reporting on it might make for some good blog posts and a decent twitter stream at the time of the testing as well.
I actually found a demo we captured on Youtube from a couple of years ago, which unfortunately is not directly linked from our website: https://www.youtube.com/watch?v=-Hn0_btDkL8&list=PLGo1-Ya-AEQDwCefjdcEvvuxYBEd31Bo2 . A node is taken down at mark 2:03 in the video.
I will still follow up to try to get a newer one with setup details.