The current implementation of batch reads implement scatter gather approach. The batch request is sent (scattered) to all nodes in the cluster, the individual nodes return the records that they have and client stitches (gather) responses from individual nodes and pass on the combined results to the application.
These aren’t atomic operations. The individual node gets the batch request, reads the record digests for requested records from the primary index tree and then start reading the individual records from the storage (memory or disk). Any changes to records (delete, expiry, eviction) or any changes to cluster state (migration) would affect the records returned by the batch call.
Consider 4 node cluster, 100 records in batch request and ~25 records being returned by each node. 5th node is added, and now with migrations kicking in you might only have ~20 records on each node and the batch request that was initiated with 4 node cluster, might get you ~80 records and wouldn’t find ~20 records which moved to other nodes.
I’m specifically interested in the read behavior when write master fails. By default, read is done on write master. In case of non batch reads and write master fails, prole read ( via proxy ) happens. Is this the same case with batch reads too ?.
Also, the situation you explained above ( batch request of 100 and each node returned ~25 records ), isn’t the same will happen for non-batch reads too. For ex : consider 2 node cluster, each node has 50 records data. Now a single read request came and 3rd node is added, and now with migrations kicking in, you might not have data present in original node. Is this the right behaviour ?
Normal read and batch read operations differ in certain parts.
Consider, normal read situation:
App requests key:A, client determines that Node 1 is master for this key/record. The request is sent to Node1. In the meanwhile, the record is no longer on Node 1. It would proxy this request to the new master of Key:A or gets into duplicate resolution in case its still master but doesn’t have the key/record.
Consider batch read situation:
App requests key:A [and other keys] , client determines that Node 1 is master for this key:A/record. The request is sent to Node1. In the meanwhile, the record is no longer on Node 1. Batch reach doesn’t do proxy or duplicate resolution and would simply return empty record indicating that the record couldn’t be found
Yes, this answers my question. I just have one more question, if i retry the failed batch request, will it go to different node or the same failed node ?. Also the above behavior will remain same if we read from replicas.
The batch requests will return incomplete results only if the records migrate due cluster state change, between the batch request execution start and the actual record read. If the batch request is re-executed, going by the probability that the cluster won’t go through another cluster state change in the short time span, you should get the complete results.