Strong consistency in community edition

Hi All,

I understand that Strong consistency is an enterprise feature. I did some consistency related experiments with community edition (CE). I am working with aerospike server version 4.2.0.10. Its a two node setup with below policies. Replication factor is set to 2.

cfg.policies.read.replica = AS_POLICY_REPLICA_SEQUENCE;
cfg.policies.write.replica = AS_POLICY_REPLICA_SEQUENCE;
cfg.policies.write.commit_level = AS_POLICY_COMMIT_LEVEL_ALL;

cfg.policies.write.base.total_timeout = 0;
cfg.policies.write.base.max_retries = 10;
cfg.policies.write.base.sleep_between_retries = 500;

cfg.policies.read.base.total_timeout = 0;
cfg.policies.read.base.max_retries = 10;
cfg.policies.read.base.sleep_between_retries = 300;


I’ve below queries.

  1. According to my observation replica writes are happening & client gets acknowledged only after write completes on replica nodes. How is it differently handled in enterprise when it comes to writes on replica nodes?

Node1 / bcc060012ac4202:

    Oct 24 2019 07:39:22 GMT: WARNING (rw): (write.c:262) : transaction replication started as_write_start
    Oct 24 2019 07:39:22 GMT: WARNING (rw): (write.c:300)  inside start_write_repl_write
    Oct 24 2019 07:39:22 GMT: WARNING (rw): (rw_utils.c:86) send_rw_messages  abt to send rw msg to node over fabric bcc050012ac4202

Node2 / bcc050012ac4202:

Oct 24 2019 07:39:22 GMT: WARNING (rw): (rw_request_hash.c:412) : abt to handle replica writes rw_msg_cb
Oct 24 2019 07:39:22 GMT: WARNING (rw): (replica_write.c:298)  abt to write record in repl_write_handle_op

Node1 / bcc060012ac4202:

Oct 24 2019 07:39:22 GMT: WARNING (rw): (rw_request_hash.c:416) : abt to handle replica writes ack rw_msg_cb
Oct 24 2019 07:39:22 GMT: WARNING (rw): (replica_write.c:311) repl_write_handle_ack  repl-write ack: from node bcc050012ac4202
Oct 24 2019 07:39:22 GMT: WARNING (rw): (replica_write.c:413) repl_write_handle_ack  abt to do cb as we hrd from node bcc050012ac4202
Oct 24 2019 07:39:22 GMT: WARNING (rw): (write.c:409)  inside should be marked complete for all write_repl_write_cb


2. Experiment:

 1. Write 'a' as value in 100K records.
 2. Bring master node down only after above write is complete.
 3. Overwrite above 100K records with 'b'.
 4. Bring master node up & Immediately (before migrations complete) read above 100K records.
 5. As expected latest data ('b') got served.

From consistency point of view, how is it differently handled in enterprise when it comes to serving reads while migrations (in context of partition re-balancing) going on?



3. Are above mentioned policies enough to achieve strong consistency (for all the replica commits)?


Thanks,
Mahaveer

Now try creating a network partition such that a client can talk to either server node but the server nodes cannot communicate.

Start two clients, each seeded with a different node. From one client write ‘c’ to all records and from the other write ‘d’.

Fix the network partition and read all records.

In SC, these writes will fail since there aren’t enough nodes to achieve the desired replication-factor. When you read you would see ‘b’.

In AP, these writes will succeed. When you read you would see a mix of ‘c’ and ‘d’.

Thanks Kevin for prompt response.

Just to set context again I am using 2 node-cluster with community edition.

In absence of commit-to-device configuration option, replica node commits only in memory before acknowledging success to client node. In this case, if replica node doesn’t have enough memory to commit transaction, would it be communicated as failure to client?

Replica partitions are allowed to ignore memory limits defined in the configuration (I assume you are referring to stop_writes_pct). If the node is really out of RAM and a transaction attempts to allocate more then that node is going to crash.

BTW, eviction and expiration are problematic on 2 node clusters prior to 4.5.1. The algorithm was changed in 4.5.1 which improves the performance of these processes as well as eliminate the strange behavior that occurs in 2 node clusters. You should upgrade soon - if you cannot do so soon, you may want to add a third node to prevent these issues.

Basically the old algorithm on a 2 node cluster would result in one node evicting all of its master partitions.