Error 22 on node addition with strong consistency


#1

Synopsis:

With Aerospike 4.0 through 4.2, a brief period of “stop writes” seen on node addition ( including restart ) with Strong Consistency enabled is sometimes observed.

Symptoms:

Strong consistency namespaces on all or some nodes in the cluster might briefly stop writes when Aerospike service on node(s) restarts or when new node(s) are added. Not every node addition on an affected cluster will see the symptom, and smaller clusters will see the symptom more frequently.

This effect is transient, and, within a few seconds the affected nodes will start accepting writes. The client will see an “error 22” (deterministic can’t write) during the affected period.

The logs will indicate that writes are stopped because the clock skew exceeds the tolerable skew (20 seconds with default configuration).

No potentially consistent writes will be accepted, so data is not lost. Some clusters, which have not observed a large transient data skew in the past, will never see this behavior.

Cause

This effect happens when the server goes through a large time discontinuity. This should not normally happen if the servers (hosts and VMs if using virtualization) have their clocks and timezones configured properly and in sync via NTP.

Aerospike uses a Hybrid Logical Clock (HLC) for clustering and skew detection. Nodes exchange HLC and physical clocks via heartbeat messages. The HLC value at a node is the maximum value of the physical clock a node has ever seen. The algorithm at a node computes the difference between the nodes’ physical clock from the corresponding HLC clock.

Large HLC to physical clock differences are not bad by themselves. Skew is the difference between HCL values in a cluster, not the difference between physical and HCL, so HLC to local values can be large, as long as they are very similar. If they are similar, it is said there is no skew in the cluster.

HCL must also never move backward. For this reason, if the cluster ever accepts large local clock values, the difference between HCL and physical will persist indefinitely.

However, in 4.0 through 4.2, there is a problem which occurs when the HLC to physical clock difference is large when using strong consistency.

When a node restarts or a new node is added, its HLC is initialized to the local physical clock value. With this initial condition, this node’s difference between HLC and physical clock is zero. When this node broadcasts its first heartbeat, the recipient nodes see that this node is not same distance away from its HLC as the rest of the cluster and flag this as skew - disallowing writes. However once the new node receives its first heartbeat from other nodes in the cluster, it syncs its HLC to the rest of the nodes. The node in question eventually updates its HLC, move to a state of no skew, writes are enabled.

For the HLC to skew from the logical clock so drastically, most likely some cluster nodes had a large forward jump in their physical clocks in the past. This causes the internal Hybrid Logical Clock (HLC) to jump far ahead of the physical clocks. Once the skew in the cluster is fixed the physical clocks will be in sync, but the HLC can’t jump “backward” to be more similar to the physical clock.

Mitigation:

  1. Follow the “detection” steps below to make certain this cluster has been affected by this known problem.
  2. Shut down the entire cluster momentarily to clear the clock state.

Aerospike will be issuing an update to remove the “synchronization error” that occurs during this period.

Detection:

If we do see the symptom of failed writes during node addition or restart, we can determine whether this root cause is active. The dump-skew info call detailed below on all nodes will detect a large HLC / physical clock difference.

asinfo -v "dump-skew:verbose=true"

This will dump the skew monitor’s state to Aerospike logs similar to the log lines below

Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:585) CSM: cluster-clock-skew:0
Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:596) CSM:	node:bb9040012ac4202 hlc-delta:16999999
Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:596) CSM:	node:bb9030012ac4202 hlc-delta:16999999
Jul 13 2018 17:42:10 GMT: INFO (skew): (skew_monitor.c:596) CSM:	node:bb9020012ac4202 hlc-delta:16999999

When run against a given node, the command will print the skew between itself and all other nodes in the cluster to the log. You must go to the logs of the servers in question to receive the results.

Here a large value (> tolerable skew ~20 seconds) is observed for hlc-delta, and small value in the initial line of cluster-clock-skew indicates the possible presence of this root cause. The clock is detected not to have skewed, but to have large HLC to physical difference.

Keywords

STRONG CONSISTENCY CLOCK SKEW ERROR 22 STOP WRITES

Timestamp

07/23/18


#2

What is the logical clock here? Typically Hybrid Logical Clock would be a tuple of (physical clock, Lamport clock) - the Lamport clock would be the Logical Clock. So what is being exchanged via Heartbeat messages when you say:

A figure/stick diagram/sketch with some timestamps / numbers on a 3 node cluster will help explaining the clocks (HLC, Logical and Physical) and skew.


#3

I read the CE code at fabric/hlc.c and the paper at: http://www.cse.buffalo.edu/tech-reports/2014-04.pdf

HLC is a 64 bit value comprising 48 bits of physical clock and 16 bits of lamport clock. The higher 48 bits are physical clock time in ms ie 64 bits system clock << (left shifted) 16 bits. So the algorithm makes the physical time portion of the HLC closely track the highest physical time across all nodes as heartbeat messages are exchanged. (For the algorithm, see the second algorithm described in the paper above that does not let the HLC to physical clock difference grow unbounded.)

In my opinion: “For the HLC to skew from the logical clock so drastically…” should be changed to “For the HLC to skew from the local physical clock so drastically…”.

(Heartbeat messages sending both the 64 bit HLC and the local physical clock help determine skew at each node for all other nodes, thus finding differences in skew as moving averages - the largest of which is the cluster skew.)


#4

With regard to “For the HLC to skew from the logical clock so drastically…” line, yes the physical clock would work better. The intent however, in the context set in previous paragraph, was to say that the initial HLC for a restarted node would skew from the cluster wide HLC drastically if there was a large forward jump in the physical clocks in the past.

The comment in hlc.c explains that we use HLC more as a monotonically increasing identifier and to identify happens before/after relationship between events tagged with HLC timestamps. To precisely detect skew in physical clocks we do not bound the HLC, unlike the paper.

This allows us to get a clear cluster wide picture of the clocks based on the hlc values, since the HLC timestamps at cluster members will be approximately equal (within tolerance of heartbeat intervals and network trip times). Rejecting HLC updates would mean some nodes will run with drastically different HLC values.


#5

Thanks for clarifying. Also, I think, since the physical clocks are increasing by far more milliseconds between heartbeats than the “lamport” clock, even though the lamport part implementation is initially winning due to forward clock jump say on one node -pushing everyone ahead on the HLC, the physical clock should eventually catch up to the hybrid logical clock. (Assuming, eventually all nodes are synchronized to NTP server clock or their physical clocks are within 10s of ms.)