Inconsistent data after simulated failure

I’m evaluating Aerospike, and have hit an immediate roadblock. I am testing a very basic failure scenario where one server goes down in a cluster of three using SSD storage. The test cluster is configured like so: https://gist.github.com/parasyte/de0905529b41f282aef2

The servers are in EC2, using VPC. The heartbeat mesh is configured so that each server connects to the other two.

In the test, I ended up with a result where subsequently fetching the same key provided me with two different versions of the data, alternating between the old value and the new value (written during failover).

I create three test keys in my namespace:

aql> insert into auth.test-del (PK, key, value) values ('key-1', 'key-1', 1)
OK, 1 record affected.

aql> insert into auth.test-del (PK, key, value) values ('key-2', 'key-2', 1)
OK, 1 record affected.

aql> insert into auth.test-del (PK, key, value) values ('key-3', 'key-3', 1)
OK, 1 record affected.

aql> show sets
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| n_objects | set-enable-xdr | set-stop-write-count | ns_name | set_name   | set-delete | set-evict-hwm-count |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| 2         | "use-default"  | 0                    | "auth"  | "test-del" | "false"    | 0                   |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
1 row in set (0.001 secs)
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| n_objects | set-enable-xdr | set-stop-write-count | ns_name | set_name   | set-delete | set-evict-hwm-count |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| 1         | "use-default"  | 0                    | "auth"  | "test-del" | "false"    | 0                   |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
1 row in set (0.000 secs)
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| n_objects | set-enable-xdr | set-stop-write-count | ns_name | set_name   | set-delete | set-evict-hwm-count |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| 3         | "use-default"  | 0                    | "auth"  | "test-del" | "false"    | 0                   |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
1 row in set (0.000 secs)
OK

The third server is a master of 2 keys (which I can see using asmonitor -e 'info auth') So I stop aerospike on the third server, and update two of the keys:

aql> insert into auth.test-del (PK, key, value) values ('key-1', 'key-1', 0)
OK, 1 record affected.

aql> insert into auth.test-del (PK, key, value) values ('key-2', 'key-2', 0)
OK, 1 record affected.

These keys should now have a value of 0.

aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key     | value |
+---------+-------+
| "key-1" | 0     |
+---------+-------+
1 row in set (0.001 secs)

aql> select * from auth.test-del where PK='key-2'
+---------+-------+
| key     | value |
+---------+-------+
| "key-2" | 0     |
+---------+-------+
1 row in set (0.001 secs)

aql> select * from auth.test-del where PK='key-3'
+---------+-------+
| key     | value |
+---------+-------+
| "key-3" | 1     |
+---------+-------+
1 row in set (0.000 secs)

Looks good. Now let’s bring that server back online, and read these keys while it does its migration dance.

aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key     | value |
+---------+-------+
| "key-1" | 0     |
+---------+-------+
1 row in set (0.000 secs)

aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key     | value |
+---------+-------+
| "key-1" | 1     |
+---------+-------+
1 row in set (0.000 secs)

aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key     | value |
+---------+-------+
| "key-1" | 0     |
+---------+-------+
1 row in set (0.001 secs)

aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key     | value |
+---------+-------+
| "key-1" | 1     |
+---------+-------+
1 row in set (0.000 secs)

… etc.

This happened on the first test attempt. But I haven’t been able to reproduce it a second time. Any ideas would be helpful. I did not notice anything relevant in the logs.

1 Like

Hi Jay,

Thanks for reaching out on our support. Let us take a look and get back to you!

–meher

Hi Jay,

Take a look at transaction-repeatable-reads parameter. Basically when a node joins the cluster with data we enter an eventually consistent mode until record redistribution (migrations) completes. In your above example you should have noticed that after migrations complete, any instance of old data residing on the returning node would have been corrected.

If you had attempted to update the record using the option to fail if the generation doesn’t match the generation you read, you also would have seen the transaction fail because of the cluster seeing the newer record. The write will actually cause the node to immediately sync that record, so the subsequent read-modify-write cycle would succeed. Additionally, new writes will remain immediately consistent.

The reason transaction-repeatable-reads is off by default is that it does add significant overhead to transaction when nodes leave and rejoin (such as a rolling upgrade).

Thanks for reaching out

Thank you Kevin and Meher,

If I understand correctly, the transaction-repeatable-reads parameter allows the cluster to serve potentially stale data by relaxing read consistency during a migration.

The probability of a replacement write in my application during a failover scenario is fairly low, but it’s still a case that we need to consider. So likewise, it would be equally improbable that another replacement write will occur after the partition heals. I would be using Aerospike in mostly a write-once-read-many pattern. It is possible for data to be replaced, and does happen, just not as often as compared to data reads.

What I experienced during the test was receiving two different values for the key when read. As I recall, this was after the migration had completed (according to watching numbers in AMC) but I could be wrong. I don’t have any additional information than that, unfortunately. I have already corrected the consistency by writing a new value, just as you suggested. And now I can’t get the DB into that state again.