With Aerospike 4.0 through 4.2, a brief period of “stop writes” seen on node addition ( including restart ) with
Strong Consistency enabled is sometimes observed.
Strong consistency namespaces on all or some nodes in the cluster might briefly stop writes when Aerospike service on node(s) restarts or when new node(s) are added. Not every node addition on an affected cluster will see the symptom, and smaller clusters will see the symptom more frequently.
This effect is transient, and, within a few seconds the affected nodes will start accepting writes. The client will see an “error 22” (deterministic can’t write) during the affected period.
The logs will indicate that writes are stopped because the clock skew exceeds the tolerable skew (20 seconds with default configuration).
No potentially consistent writes will be accepted, so data is not lost. Some clusters, which have not observed a large transient data skew in the past, will never see this behavior.
This effect happens when the server goes through a large time discontinuity. This should not normally happen if the servers (hosts and VMs if using virtualization) have their clocks and timezones configured properly and in sync via NTP.
Aerospike uses a Hybrid Logical Clock (HLC) for clustering and skew detection. Nodes exchange HLC and physical clocks via heartbeat messages. The HLC value at a node is the maximum value of the physical clock a node has ever seen. The algorithm at a node computes the difference between the nodes’ physical clock from the corresponding HLC clock.
Large HLC to physical clock differences are not bad by themselves. Skew is the difference between HCL values in a cluster, not the difference between physical and HCL, so HLC to local values can be large, as long as they are very similar. If they are similar, it is said there is no skew in the cluster.
HCL must also never move backward. For this reason, if the cluster ever accepts large local clock values, the difference between HCL and physical will persist indefinitely.
However, in 4.0 through 4.2, there is a problem which occurs when the HLC to physical clock difference is large when using strong consistency.
When a node restarts or a new node is added, its HLC is initialized to the local physical clock value. With this initial condition, this node’s difference between HLC and physical clock is zero. When this node broadcasts its first heartbeat, the recipient nodes see that this node is not same distance away from its HLC as the rest of the cluster and flag this as skew - disallowing writes. However once the new node receives its first heartbeat from other nodes in the cluster, it syncs its HLC to the rest of the nodes. The node in question eventually updates its HLC, move to a state of no skew, writes are enabled.
For the HLC to skew from the logical clock so drastically, most likely some cluster nodes had a large forward jump in their physical clocks in the past. This causes the internal Hybrid Logical Clock (HLC) to jump far ahead of the physical clocks. Once the skew in the cluster is fixed the physical clocks will be in sync, but the HLC can’t jump “backward” to be more similar to the physical clock.
- Follow the “detection” steps below to make certain this cluster has been affected by this known problem.
- Shut down the entire cluster momentarily to clear the clock state.
Aerospike will be issuing an update to remove the “synchronization error” that occurs during this period.
If we do see the symptom of failed writes during node addition or restart, we can determine whether this root cause is active. The
dump-skew info call detailed below on all nodes will detect a large HLC / physical clock difference.
asinfo -v "dump-skew:verbose=true"
This will dump the skew monitor’s state to Aerospike logs similar to the log lines below
Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:585) CSM: cluster-clock-skew:0 Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:596) CSM: node:bb9040012ac4202 hlc-delta:16999999 Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:596) CSM: node:bb9030012ac4202 hlc-delta:16999999 Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:596) CSM: node:bb9020012ac4202 hlc-delta:16999999
When run against a given node, the command will print the skew between itself and all other nodes in the cluster to the log. You must go to the logs of the servers in question to receive the results.
Here a large value (> tolerable skew ~20 seconds) is observed for hlc-delta, and small value in the initial line of
cluster-clock-skew indicates the possible presence of this root cause. The clock is detected not to have skewed, but to have large HLC to physical difference.
STRONG CONSISTENCY CLOCK SKEW ERROR 22 STOP WRITES