Handling timeout in case of counter bin


#1

Hi, I have a requirement to store a counter in Aerospike. I will be using Java client. I plan to have a set with single integer based bin. No of keys would be ~1K. Clients will perform add() operation to do (de)increment. We have multiple client servers which would be calling add operation to aero nodes parallely. I am little skeptical about the timeout scenario when client wouldn’t know if the add() op succeeded and may retry causing overcount. Any suggestions to handle this?


#2

For counters you’d use the optimization of data-in-index storage, which places the 8B integer in the primary index metadata entry. This means you don’t spend an extra 8B on top of the initial 64B of metadata. Also, the read/write operations don’t have to first find the metadata in the primary index, then skip to the storage location in DRAM - it happens in place, which means much lower latency.

For in-memory namespaces, if you want to push latencies down further, you should be making use of more sprigs in your configuration (partition-tree-sprigs) and CPU pinning (auto-pin true).

Read up on the request flow for the standard Aerospike configuration is such that every update locks the record before modifying it. The node with the master partition of the record will communicate the replica write to the node(s) with replica partitions of the record. Only after the local write and the replica writes are accepted does the client get back a success on the write.

If you have multiple Aerospike nodes, and are concerned about consistency in the rare event of a cluster splitting, you should be reading up on the strong consistency mode of Aerospike Enterprise Edition 4. You have both Linearizable Strong Consistency and Sequential Strong Consistency as consistency modes to choose from.

If you feel skeptical about a system you don’t know, you should be doing two things. First, educate yourself - I provided documentation for you to read. Second, test it for yourself. The community can help you with configuration tips. You have the asbenchmark tool (in the tools package) to assist you with testing your cluster’s performance after you configure it.


#3

@Ronen I think the question is more fundamental and simple and quite interesting. Client sends a TCPIP request to Aerospike Server to increment a bin. Aerospike updates the bin, and sends back a TCPIP message back to the client. Multiple clients are updating the same counter bin.

[C1] --> [TCPIP msg1] —> Aerospike – Aeropsike does its locking etc, updates the bin.

[Aeropsike] --> [TCPIP msg2 – Hey C1, I did your update] --> C1

C1 does not know that Aeropsike did the update till it gets msg2 back. I believe TCP/IP will guarantee delivery of msg1 to Aerospike or its failure to client (am I correct on that?). But if client does not get back msg2, what should the client do? How should C1 handle this - and assume multiple clients are queued up to increment the same bin. I have thought about Check/and/Set - but again timeout on not getting msg2 still leaves you in an ambiguous state. Any suggestions?


#4

To rephrase the question, “is there a way to know if a write definitely applied or definitely didn’t apply?”

First, timeouts are not the only error you should be concerned with. Newer clients have an ‘inDoubt’ flag associated with errors that will indicate that the write may or may not have applied.

There isn’t a built-in way of resolving an in-doubt transaction to a definitive answer and if the network is partitioned, there isn’t a way in AP to rigorously resolve in-doubt transactions. Rigorous methods do exist for ‘Strong Consistency’ mode, the same methods can be used to handle common AP scenarios but they will fail under partition.

The method I have used is as follows:

  1. Each record will need a list bin, the list bin will contain the last N transaction ids.
    • For my use case, I gave each client an unique 2 byte identifier - each client thread a unique 2 byte identifier - and each client thread had a 4 byte counter. So a particular transaction-id would look like would mask an 8 byte identifier from the 2 ids and counter.
  2. * Read the records metadata with the getHeader api - this avoids reading the records bins from storage.
    • Note - my use case wasn’t an increment so I actually had to read the record and write with a generation check. This pattern should be more efficient for a counter use case.
  3. Write the record using operate and gen-equal to the read generation with the these operations: increment the integer bin, prepend to the list of txns, and trim the list of txns. You will prepend you transaction-id to your txns list and then trim the list to the max size of the list you selected.
    • N needs to be large enough such that a record can be sure to have enough time to verify its transaction given the contention on the key. N will affect the stored size of the record so choosing too big will cost disk resource and choosing too small will render the algorithm ineffective.
  4. If the transaction is successful then you are done.
  5. If the transaction is ‘inDoubt’ then read the key and check the txns list for your transaction-id. If present then your transaction ‘definitely succeeded’.
  6. If your transaction-id isn’t in txns, repeat step 3 with the generation returned from the read in step 5.
  7. Return to step 3 - with the exception that on step 5 a ‘generation error’ would also need to be considered ‘in-doubt’ since it may have been the previous attempt that finally applied.

Also consider that reading the record in step 5 and not finding the transaction-id in txns does not ensure that the transaction ‘definitely failed’. If you wanted to leave the record unchanged but have a ‘definitely failed’ semantic you would need to have observed the generation move past the previous write’s gen-check policy. If it hasn’t you could replace the operation in step 6 with a touch - if it succeeds then the initial write ‘definitely failed’ and if you get a generation-error you will need to check if you raced the application of the initial transaction initial write may now have ‘definitely succeeded’.

Again, with ‘Strong Consistency’ the mentions of ‘definitely succeeded’ and ‘definitely failed’ are accurate statements, but in AP these statements have failure modes (especially around network partitions).


Understanding Timeout and Retry policies
#5

@Sujeet_Jaiswal The question you ask is fundamental to any database - Relational or NoSQL - doesn’t matter. In a pure ACID relational database you will deal with this by either generating a unique transaction id in the client or getting a unique transaction id from the database and then invoking a rollback should you not get a confirmation of success. In Aerospike, you will have to implement that functionality yourself using the methodology proposed by @kporter - and one could improvise many flavors of it depending on the use case.

The real business question you must ask before embarking on a purist solution is why do you need such consistency on a counter? Does it matter to your business use case if few counts were missed? What is the true frequency of your timeouts? Can you drive timeouts down to zero or near zero for your network by adjusting the timeout parameter? Counters are mostly used to see effectiveness - 90% or 91% is that crucial for business?

So purist solutions exist but are painful but can be implemented. Do you really need them? If you do, implement a transaction id based list in another bin.


#6

Thanks @rbotzer, @kporter @pgupta for the detailed suggestions. It will definitely help me in implementing my solution.