I’m evaluating Aerospike, and have hit an immediate roadblock. I am testing a very basic failure scenario where one server goes down in a cluster of three using SSD storage. The test cluster is configured like so: https://gist.github.com/parasyte/de0905529b41f282aef2
The servers are in EC2, using VPC. The heartbeat mesh is configured so that each server connects to the other two.
In the test, I ended up with a result where subsequently fetching the same key provided me with two different versions of the data, alternating between the old value and the new value (written during failover).
I create three test keys in my namespace:
aql> insert into auth.test-del (PK, key, value) values ('key-1', 'key-1', 1)
OK, 1 record affected.
aql> insert into auth.test-del (PK, key, value) values ('key-2', 'key-2', 1)
OK, 1 record affected.
aql> insert into auth.test-del (PK, key, value) values ('key-3', 'key-3', 1)
OK, 1 record affected.
aql> show sets
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| n_objects | set-enable-xdr | set-stop-write-count | ns_name | set_name | set-delete | set-evict-hwm-count |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| 2 | "use-default" | 0 | "auth" | "test-del" | "false" | 0 |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
1 row in set (0.001 secs)
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| n_objects | set-enable-xdr | set-stop-write-count | ns_name | set_name | set-delete | set-evict-hwm-count |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| 1 | "use-default" | 0 | "auth" | "test-del" | "false" | 0 |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
1 row in set (0.000 secs)
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| n_objects | set-enable-xdr | set-stop-write-count | ns_name | set_name | set-delete | set-evict-hwm-count |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
| 3 | "use-default" | 0 | "auth" | "test-del" | "false" | 0 |
+-----------+----------------+----------------------+---------+------------+------------+---------------------+
1 row in set (0.000 secs)
OK
The third server is a master of 2 keys (which I can see using asmonitor -e 'info auth'
) So I stop aerospike on the third server, and update two of the keys:
aql> insert into auth.test-del (PK, key, value) values ('key-1', 'key-1', 0)
OK, 1 record affected.
aql> insert into auth.test-del (PK, key, value) values ('key-2', 'key-2', 0)
OK, 1 record affected.
These keys should now have a value of 0.
aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key | value |
+---------+-------+
| "key-1" | 0 |
+---------+-------+
1 row in set (0.001 secs)
aql> select * from auth.test-del where PK='key-2'
+---------+-------+
| key | value |
+---------+-------+
| "key-2" | 0 |
+---------+-------+
1 row in set (0.001 secs)
aql> select * from auth.test-del where PK='key-3'
+---------+-------+
| key | value |
+---------+-------+
| "key-3" | 1 |
+---------+-------+
1 row in set (0.000 secs)
Looks good. Now let’s bring that server back online, and read these keys while it does its migration dance.
aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key | value |
+---------+-------+
| "key-1" | 0 |
+---------+-------+
1 row in set (0.000 secs)
aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key | value |
+---------+-------+
| "key-1" | 1 |
+---------+-------+
1 row in set (0.000 secs)
aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key | value |
+---------+-------+
| "key-1" | 0 |
+---------+-------+
1 row in set (0.001 secs)
aql> select * from auth.test-del where PK='key-1'
+---------+-------+
| key | value |
+---------+-------+
| "key-1" | 1 |
+---------+-------+
1 row in set (0.000 secs)
… etc.
This happened on the first test attempt. But I haven’t been able to reproduce it a second time. Any ideas would be helpful. I did not notice anything relevant in the logs.