Obtain generation of record's master and prole version

Hello, we are investigating an interesting edge case in our system that the only way to happen is if a record has multi-generation values. Our cluster is known to be unstable - constant timeouts with inDoubts=true/false. Our write/read policies are default meaning we are waiting for all replicas to acknowledge the write before returning to the client, and sequentially reading master and prole objects from the cluster. The only viable case we came up with is if write “succeeded” during timeout with inDoubt=true, and then followed by a read from a replica that still not latest version of the record.

So my questions are:

  1. Is this possible at all?
  2. Is there a way to monitor this “replication lag” between master and prole partitions?
  3. Is there a way to check a single record generations in all of its partitions (master + prole)?

Thanks in advance!

  1. If this is an AP cluster then yes (especially in the unstable cluster that you describe), no if strong-consistency.
  2. Replication via migrations happens in digest order per partition. So the lag in migrations isn’t the same as the lag in XDR. We only have the partitions remaining stats.
  3. You could issue a debug-record-meta info command to all nodes - debug-record-meta:namespace=<ns-name>;keyd=<hex-digest>. Note that this bypasses duplicate-resolution, so this isn’t the same as the client targeting this node for a read.

Isn’t XDR cross-datacenter replication? I am talking here only about a single cluster. Also, would it help if I configure my reads to go only to the master partition of a key since I am not worried about concurrency over it?

You/d mentioned “replication lag” so I assumed that you were familiar with XDR’s lag stats. Was simply saying that due to the difference in how they operate, such stats wouldn’t be possible for partition migrations.

You mean for the purpose of getting the records version from the master and prole? If you target a partition that is migrating, that sometimes means that the latest copy may not be on the master node. When you configure the client with read_consistency_level_all, it means that we should check other copies (read duplicate resolve) in such situations. This duplicate resolution is driven by the partition master (the client isn’t the one sending a request to all replicas). So if you configure read_consistency_level_one (default), in such situations, you will only read from the master node (other nodes in such situations will proxy to the master).

I am not talking about migrations. Maybe I am missing something but let me clarify.

We suspect we’ve hit a case where

  1. We write a record to aerospike.
  2. Then the transaction times out.
    1. From further investigation we could deduce that for this write transaction, at least one instance of this record (master or replica) was updated.
  3. We do a read to the same key milliseconds later.
    1. Based on the behavior of our system, logs, and the fact that this is happening for the first time the only logical explanation was that the read transaction read a stale record (master or replica).

During all this time, except for the consistent timeout issues that we had no migrations were involved or anything of the sort. The cluster size metric was stable. Since our policy for the read transaction was with default values meaning it is using sequence setting, if all our assumptions are possible and true would it be possible to mitigate this by changing this setting to master assuming that write transactions will attempt a write first in master partition?

Definitely possible in AP, especially in an unstable cluster environment that you described. Reading a stale prole isn’t the only possibility in such situations.

I assumed the opposite based on “Our cluster is known to be unstable”.

Changing from sequence to master would mitigate this particular scenario, but not if the unstable cluster does cause recluster events.

Now that I understand which type of replication you are after, there are micro-benchmarks that you can enable which will show how much time a write transaction spends at various phases. In particular, see write-repl-write.