Minimal Heartbeat

Guy_Sela · February 14, 2016, 4:46pm

Hi I am using Aerospike Server 3.7.2.

This is my heartbeat configuration:

heartbeat {
            mode mesh
            address 192.168.231.16
            port 3002 # Heartbeat port for this node.

            mesh-seed-address-port 192.168.231.18 3002

            interval 100 # Interval in milliseconds in which heartbeats are sent.
            timeout 5 # Number of missing heartbeats after which the remote node will be declared dead.
    }

In my scenario I have 2 nodes, and I’m taking one of them down. I think it takes about ~1.2 seconds until the other node’s “friendlist” is updated and evicted. I suspect that during this time, the hearbeat socket is HUNG, and is released only after a socket timeout. If that is correct, what is the timeout that is configured on the heartbeat sockets? Is it configurable?

rbotzer · February 19, 2016, 7:07pm

It depends on where you’re measuring for this change. The time it will take for the client to identify the cluster changing depends on cluster tending, the thread that checks every second on the state of the cluster. Is that how you’re getting the information?

The heartbeat socket timeout is calculated based on your interval. It ends up being be shorter than the interval.

One other thing to consider is how you’re ‘taking down the cluster’. If you’re shutting down the daemon (asd), that’s one way. If you’re simply getting in the way of the mesh heartbeat, be aware that once a node is added to the cluster there is also fabric back channel established, and even if the heartbeats are missed the cluster still tries to talk to the node through the fabric. This may extend the time till the node is out of the cluster.

Guy_Sela · February 21, 2016, 9:40am

First of all, my client’s tend thread is running every 250ms.

The configuration of the heartbeat is what I attached:

interval 100
timeout 5

When I take down a node, it takes the client about 1.5 seconds until it is aware of that.

I used some debugging tools to identify that I get 4-5 “failed” cycles of tend, before the node sends me a friend-list without the other node.

The client’s code will not remove the bad node until the good node sends him a friend-list without the bad node. So, to sum up, I wanted to identify that a node went down in less than 1 second.

For some reason, a friend-list without the bad node, is sent to me only after ~1.2 seconds, even though the hearbeat is configured for 500ms (100*5).

Topic		Replies	Views
0 downtime configuration Tuning	3	1831	September 8, 2014
Aerospike multisite cluster - resiliency	15	198	May 25, 2025
Mesh Configuration Configuration	1	2593	August 16, 2014
How to Build Cluster Installation	7	1650	April 3, 2018
Problems Configuring Clustering on AWS EC2 with 3 DB Instances Configuration	2	1991	August 21, 2015

Minimal Heartbeat

Related topics