Node dropping out of cluster without AMC acknowledging

I’m running 2 OpenNebula VMs with CentOS.

I’ve installed CE 3.5.3 on both nodes, and AMC and the benchmarks on the 1st node. edit: Is the CE restricted to a single node? Is it something that basic?

I can add the 2nd node in AMC no problem. After adding, I can run_benchmarks and get confirmation that both nodes are reachable:

2015-03-02 20:14:16.410 INFO Thread 1 Add node BB98405080A0002 127.0.0.1:3000
2015-03-02 20:14:16.425 INFO Thread 1 Add node BB98705080A0002 10.8.5.135:3000
2015-03-02 20:14:16.500 write(tps=16 timeouts=0 errors=0) read(tps=85 timeouts=0 errors=0) total(tps=101 timeouts=0 err

But soon after, the 2nd node doesn’t get added any more:

2015-03-02 21:08:06.728 INFO Thread 1 Add node BB98405080A0002 127.0.0.1:3000
2015-03-02 21:08:06.786 write(tps=29 timeouts=0 errors=0) read(tps=123 timeouts=0 errors=0) total(tps=152 timeouts=0 errors=0)

AMC shows both nodes as up and green Cluster Visibility for some time after that, but then they both show as read. Even while run_benchmarks still works, both nodes stay red.

If I issue an aerospike service restart on the 2nd node, it rejoins, even in the middle of a run_benchmarks:

2015-03-02 22:06:30.988 write(tps=13364 timeouts=0 errors=0) read(tps=13442 timeouts=0 errors=0) total(tps=26806 timeouts=0 errors=0)
2015-03-02 22:06:31.914 INFO Thread 8 Add node BB98705080A0002 10.8.5.135:3000
2015-03-02 22:06:31.988 write(tps=5211 timeouts=0 errors=0) read(tps=5275 timeouts=0 errors=0) total(tps=10486 timeouts=0 errors=0)

And then the 2 nodes both go green in AMC.

How can I get more detail about why the 2nd node is dropping out? What errors should I normally expect from AMC about that event?

One more piece of evidence. I kept a run_benchmarks running for a long time to see if it would register a dropped/disconnected node. It didn’t, and AMC kept graphing throughput from both nodes throughout.

I canceled the benchmark and relaunched it. This time, it didn’t add the 2nd node. And AMC’s graph for the 2nd node’s throughput dropped to zero.

2015-03-02 22:33:34.595 write(tps=3424 timeouts=0 errors=0) read(tps=3391 timeouts=0 errors=0) total(tps=6815 timeouts=0 errors=0)
2015-03-02 22:33:35.595 write(tps=3355 timeouts=0 errors=0) read(tps=3401 timeouts=0 errors=0) total(tps=6756 timeouts=0 errors=0)
2015-03-02 22:33:36.595 write(tps=3338 timeouts=0 errors=0) read(tps=3387 timeouts=0 errors=0) total(tps=6725 timeouts=0 errors=0)
2015-03-02 22:33:37.596 write(tps=3367 timeouts=0 errors=0) read(tps=3414 timeouts=0 errors=0) total(tps=6781 timeouts=0 errors=0)
2015-03-02 22:33:38.596 write(tps=3237 timeouts=0 errors=0) read(tps=3410 timeouts=0 errors=0) total(tps=6647 timeouts=0 errors=0)
2015-03-02 22:33:39.597 write(tps=3334 timeouts=0 errors=0) read(tps=3337 timeouts=0 errors=0) total(tps=6671 timeouts=0 errors=0)

[root@mmao-aerospike_ce benchmarks]# ./run_benchmarks 
Benchmark: 127.0.0.1:3000, namespace: test, set: testset, threads: 16, workload: READ_UPDATE
read: 50% (all bins: 100%, single bin: 0%), write: 50% (all bins: 100%, single bin: 0%)
keys: 100000, start key: 0, transactions: 0, bins: 1, random values: false, throughput: unlimited
read policy: timeout: 0, maxRetries: 1, sleepBetweenRetries: 500, consistencyLevel: CONSISTENCY_ONE, reportNotFound: false
write policy: timeout: 0, maxRetries: 1, sleepBetweenRetries: 500, commitLevel: COMMIT_ALL
bin[0]: integer
debug: false
2015-03-02 22:33:43.885 INFO Thread 1 Add node BB98405080A0002 127.0.0.1:3000
2015-03-02 22:33:44.111 write(tps=902 timeouts=0 errors=0) read(tps=979 timeouts=0 errors=0) total(tps=1881 timeouts=0 errors=0)
2015-03-02 22:33:45.112 write(tps=8463 timeouts=0 errors=0) read(tps=8432 timeouts=0 errors=0) total(tps=16895 timeouts=0 errors=0)
2015-03-02 22:33:46.112 write(tps=9539 timeouts=0 errors=0) read(tps=9579 timeouts=0 errors=0) total(tps=19118 timeouts=0 errors=0)

I only noticed then that both nodes were marked red. I hadn’t checked during the initial long run while both nodes were working.

Thanks for reaching out. The CE does allow multiple nodes per cluster. There are no limitations.

Did you configure your nodes to form a cluster across those VMs?

Could you share config files for your nodes and last lines of each node’s log file (under /var/log/aerospike/aerospike.log).

Thanks, –meher