Use of Scan to fetch Critical data


#1

Hi, I have a set that holds critical data, that cannot be fetched partially. Because new records can be added to this set dynamically, I am using a scan to retrieve all the data. Lately I realized that scans do not return precise results while there is a cluster change. This is obviously problematic for this type of critical data. What is the best way to read a full set, when you don’t know the keys that exist in it?


#2

What is the size of your ‘critical set’ (itemcount, avg. size per item) ? After reading both of your threads, I would tend to try going with a single record that contains a Large List (LDT)-bin and move the logic in a UDF (if possible). If you can return what you need from that UDF, you have guaranteed atomicity due to the record being locked during execution. Something that I think will be very hard to achieve with any type of scan. Large lists support some kind of scan feature, too.

If you explain more about your usecase and give us some rough estimates on desired throughput for that list (and: is the namespace in-memory or on ssd?) we could help you trying to achieve what you need.

Cheers, Manuel


#3

The data in this set is String keys and records that hold bin of type String and values of type String. Approximatley 1000 rows. A client can update a record, add a record or remove a record. Whenever one of this operations happen, the client also updates a record the represents the version of the set. All of our clients check this “version record” every 1 second, and using that mechanism, we know that something happend to the set, and we Scan it all again. This is a kind of Publish/Subscribe mechanism for changes. We are not afraid from atomicity. How would you model this type of set, and how would you read all the records upon a change?


#4

If I’d be modelling this, I would try the following approach and test if it’s allowing for enough events/sec:

  • Have 1 record that contains normal list or LDT that contains the ID/primary key and a version number for every record (“super record”) of the ‘set’.
  • Keep the records in the set like you have.
  • Modify the subscribe-client to compare the data it already got with the version numbers on a per-element base (if you are not already doing it like that from meta data of scan).
  • Modify the publisher logic to change the version numbers of all updated elements in that super record. It would be smart to use Aerospikes internal generation count a.k.a. version number for the updated records.
  • For subscriber polling you can make use of the generation count of the super record. I would write a record udf that takes the last generation number the subscriber knows as input and either returns a “No change” response OR starts to dump the list values (record PK + ‘version’). If subscriber gets a list and notices there is like 1 new record, he only requests that 1 record from Aerospike and has an up-to-date representation of the set.

This should work for atleast 3-5k events/sec if your namespace is in-memory and the list doesn’t grow to more than a few thousand entries… The scalability of this approach is limited because you have 1 very hot record. If that one is locked because somebody else is reading from/writing to it, all other requests have to wait. Would that be sufficient for you?

With NoSQL there are many possible approaches to model this… This one came to my mind at first but there might be other ways to do it. I don’t see a problem in using an in-memory namespace for the super record (and maybe ssd for the records itself) as it should be reconstructable in case of data loss if the clients update the records accordingly.

If the records will be stored on a drive/disk AND you have only 1 instance, make sure to read about how to permantly delete records with AS (touch with low TTL on it before calling delete()) to not suffer from zombie records if one day you need to restart (topic covered a lot in forum). This will not be necessary if you always keep atleast 1 node of a cluster online.

Cheers, Manuel


#5

Hi, thank you for your elaborated answer. What would you do if you have several “writers” that may add records. This would cause a concurrency problem when two “writers” will try to update the “super record”. First one reads state X. Second one reads state X. First one wants to add record, so he rewrites the super record with X + 1. Second one wants to add records, so he rewrites the super record with X + 1, erasing the change of the first writer.


#6

Hi, there are two approaches here:

  • Client api’s support a concept called “compare-and-set”. This is a write/PUT with a condition (“only update if generation is the same that I read”). An example with java-client is here: L84-87 at http://git.io/v08x5 . In case it fails you need to manually retry or fallback to some UDF-approach guaranteed to succeed due to locking. Depending on how many collisions you see, this can be much faster than the second approach.
  • You can create another record udf for updating the list itself in one, atomic operation. Records are locked while UDFs execute. This UDF needs to support additions, deletions and updates from a single execution… If you integrate logic like “only delete item if it’s still in list” or “update only if generation of list item is lower than x” and “only add if not already in list”, you should be able to call the UDF without any CAS on the super-record. Expecting less throughput from this though due to LUA overhead from this.

Does this help you to model your pub/sub? Let me know how you implemented it or how it worked out if you benchmark this. I’d be very interested since we have to implement something similar next year.

Cheers, Manuel


#7

Hi thanks.

Will this solution work correctly in those situations:

  1. Multi aerospike nodes and multi writers/readers. In Cassandra for example, the concept of “Eventual Consistency” sounds contradicting to concurrent “compare-and-set” on multi nodes.
  2. Split brain

#8

To start off: not an expert on this cluster-consistency topic. One of the AS folks and/or docs might be able to confirm or correct what I say.

But to give you an idea…

  1. A single record always has 1 node as master… You can configure to “read from master” or “AS_POLICY_CONSISTENCY_LEVEL_ALL”, which includes the old master during migration if I remember it correctly and “commit to all replicas” for writes. See here: http://www.aerospike.com/docs/client/c/usage/consistency.html about your options. If you use the AS-generation numbers within your list, you only need that higher consistency by default on the super record. For reading updated records, you can first query from master or any replica and just in case that version is behind re-request with AS_POLICY_CONSISTENCY_LEVEL_ALL. I think, this will give you a pretty solid concistency level in 99.999% of the cases… It won’t get any better with any other clustered solution out there.

However, higher consistency might interfere with availability in case of migrations or unreachable nodes… This is where I’m leaving familiar terrain! This is why I would like to leave 2) totally to the docs or AS staff. It might be faster to create a stack overflow question or new thread if you want a fast response about this.

But as you have noticed, the scan functionality suffers from inconsistency during migrations. With this solution you can reach better consistency but I’m afraid you might have to trade in some availability to achieve this. But with committing to multiple nodes (in-memory), this might be a unneccessary worry (as they all know the most recent version). Split brain causes me headaches… but I see no way one could find a solution for this other than not accepting writes in such a situation (that’s on their roadmap, see here: Stop Writes). Would that be ok? Anyways, from here one you’ll most likely find more truth from the docs or source code of AS than I can provide on that topic. But TBH: we don’t care about split-brain over here. If that happens, we’ll ignore it and manually intervene, if necessarry. I haven’t heard of anybody that experienced this, because AS-clusters tend to be much smaller than Cassandra or others require, which makes this case kinda unlikely if deployed with fallback network connectivity. Sorry, but I can’t solve CAP…

Cheers, Manuel