Obtain generation of record's master and prole version

kuskmen · November 11, 2024, 3:06pm

Hello, we are investigating an interesting edge case in our system that the only way to happen is if a record has multi-generation values. Our cluster is known to be unstable - constant timeouts with inDoubts=true/false. Our write/read policies are default meaning we are waiting for all replicas to acknowledge the write before returning to the client, and sequentially reading master and prole objects from the cluster. The only viable case we came up with is if write “succeeded” during timeout with inDoubt=true, and then followed by a read from a replica that still not latest version of the record.

So my questions are:

Is this possible at all?
Is there a way to monitor this “replication lag” between master and prole partitions?
Is there a way to check a single record generations in all of its partitions (master + prole)?

Thanks in advance!

kporter · November 11, 2024, 4:38pm

If this is an AP cluster then yes (especially in the unstable cluster that you describe), no if strong-consistency.
Replication via migrations happens in digest order per partition. So the lag in migrations isn’t the same as the lag in XDR. We only have the partitions remaining stats.
You could issue a debug-record-meta info command to all nodes - debug-record-meta:namespace=<ns-name>;keyd=<hex-digest>. Note that this bypasses duplicate-resolution, so this isn’t the same as the client targeting this node for a read.

kuskmen · November 12, 2024, 11:39am

Isn’t XDR cross-datacenter replication? I am talking here only about a single cluster. Also, would it help if I configure my reads to go only to the master partition of a key since I am not worried about concurrency over it?

kporter · November 12, 2024, 6:49pm

You/d mentioned “replication lag” so I assumed that you were familiar with XDR’s lag stats. Was simply saying that due to the difference in how they operate, such stats wouldn’t be possible for partition migrations.

You mean for the purpose of getting the records version from the master and prole? If you target a partition that is migrating, that sometimes means that the latest copy may not be on the master node. When you configure the client with read_consistency_level_all, it means that we should check other copies (read duplicate resolve) in such situations. This duplicate resolution is driven by the partition master (the client isn’t the one sending a request to all replicas). So if you configure read_consistency_level_one (default), in such situations, you will only read from the master node (other nodes in such situations will proxy to the master).

kuskmen · November 19, 2024, 12:30pm

I am not talking about migrations. Maybe I am missing something but let me clarify.

We suspect we’ve hit a case where

We write a record to aerospike.
Then the transaction times out.
1. From further investigation we could deduce that for this write transaction, at least one instance of this record (master or replica) was updated.
We do a read to the same key milliseconds later.
1. Based on the behavior of our system, logs, and the fact that this is happening for the first time the only logical explanation was that the read transaction read a stale record (master or replica).

During all this time, except for the consistent timeout issues that we had no migrations were involved or anything of the sort. The cluster size metric was stable. Since our policy for the read transaction was with default values meaning it is using sequence setting, if all our assumptions are possible and true would it be possible to mitigate this by changing this setting to master assuming that write transactions will attempt a write first in master partition?

kporter · November 19, 2024, 4:32pm

Definitely possible in AP, especially in an unstable cluster environment that you described. Reading a stale prole isn’t the only possibility in such situations.

I assumed the opposite based on “Our cluster is known to be unstable”.

Changing from sequence to master would mitigate this particular scenario, but not if the unstable cluster does cause recluster events.

Now that I understand which type of replication you are after, there are micro-benchmarks that you can enable which will show how much time a write transaction spends at various phases. In particular, see write-repl-write.

system · February 11, 2025, 4:33pm

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Merging Bins during XDR? XDR (Cross Data Center Replication) xdr	1	2731	January 31, 2017
Seeing increase in read timeouits when receiving XDR updates XDR (Cross Data Center Replication)	1	651	February 12, 2022
Xdr comparision XDR (Cross Data Center Replication)	4	1110	June 24, 2020
Are XDR replicated records returned in queries? XDR (Cross Data Center Replication) query	7	1666	February 13, 2018
Uing XDR replication for ingestion (writes) xdr	3	735	May 31, 2021

Obtain generation of record's master and prole version

Related topics