Node dropping out of cluster without AMC acknowledging


#1

I’m running 2 OpenNebula VMs with CentOS.

I’ve installed CE 3.5.3 on both nodes, and AMC and the benchmarks on the 1st node. edit: Is the CE restricted to a single node? Is it something that basic?

I can add the 2nd node in AMC no problem. After adding, I can run_benchmarks and get confirmation that both nodes are reachable:

2015-03-02 20:14:16.410 INFO Thread 1 Add node BB98405080A0002 127.0.0.1:3000
2015-03-02 20:14:16.425 INFO Thread 1 Add node BB98705080A0002 10.8.5.135:3000
2015-03-02 20:14:16.500 write(tps=16 timeouts=0 errors=0) read(tps=85 timeouts=0 errors=0) total(tps=101 timeouts=0 err

But soon after, the 2nd node doesn’t get added any more:

2015-03-02 21:08:06.728 INFO Thread 1 Add node BB98405080A0002 127.0.0.1:3000
2015-03-02 21:08:06.786 write(tps=29 timeouts=0 errors=0) read(tps=123 timeouts=0 errors=0) total(tps=152 timeouts=0 errors=0)

AMC shows both nodes as up and green Cluster Visibility for some time after that, but then they both show as read. Even while run_benchmarks still works, both nodes stay red.

If I issue an aerospike service restart on the 2nd node, it rejoins, even in the middle of a run_benchmarks:

2015-03-02 22:06:30.988 write(tps=13364 timeouts=0 errors=0) read(tps=13442 timeouts=0 errors=0) total(tps=26806 timeouts=0 errors=0)
2015-03-02 22:06:31.914 INFO Thread 8 Add node BB98705080A0002 10.8.5.135:3000
2015-03-02 22:06:31.988 write(tps=5211 timeouts=0 errors=0) read(tps=5275 timeouts=0 errors=0) total(tps=10486 timeouts=0 errors=0)

And then the 2 nodes both go green in AMC.

How can I get more detail about why the 2nd node is dropping out? What errors should I normally expect from AMC about that event?


#2

One more piece of evidence. I kept a run_benchmarks running for a long time to see if it would register a dropped/disconnected node. It didn’t, and AMC kept graphing throughput from both nodes throughout.

I canceled the benchmark and relaunched it. This time, it didn’t add the 2nd node. And AMC’s graph for the 2nd node’s throughput dropped to zero.

2015-03-02 22:33:34.595 write(tps=3424 timeouts=0 errors=0) read(tps=3391 timeouts=0 errors=0) total(tps=6815 timeouts=0 errors=0)
2015-03-02 22:33:35.595 write(tps=3355 timeouts=0 errors=0) read(tps=3401 timeouts=0 errors=0) total(tps=6756 timeouts=0 errors=0)
2015-03-02 22:33:36.595 write(tps=3338 timeouts=0 errors=0) read(tps=3387 timeouts=0 errors=0) total(tps=6725 timeouts=0 errors=0)
2015-03-02 22:33:37.596 write(tps=3367 timeouts=0 errors=0) read(tps=3414 timeouts=0 errors=0) total(tps=6781 timeouts=0 errors=0)
2015-03-02 22:33:38.596 write(tps=3237 timeouts=0 errors=0) read(tps=3410 timeouts=0 errors=0) total(tps=6647 timeouts=0 errors=0)
2015-03-02 22:33:39.597 write(tps=3334 timeouts=0 errors=0) read(tps=3337 timeouts=0 errors=0) total(tps=6671 timeouts=0 errors=0)

[root@mmao-aerospike_ce benchmarks]# ./run_benchmarks 
Benchmark: 127.0.0.1:3000, namespace: test, set: testset, threads: 16, workload: READ_UPDATE
read: 50% (all bins: 100%, single bin: 0%), write: 50% (all bins: 100%, single bin: 0%)
keys: 100000, start key: 0, transactions: 0, bins: 1, random values: false, throughput: unlimited
read policy: timeout: 0, maxRetries: 1, sleepBetweenRetries: 500, consistencyLevel: CONSISTENCY_ONE, reportNotFound: false
write policy: timeout: 0, maxRetries: 1, sleepBetweenRetries: 500, commitLevel: COMMIT_ALL
bin[0]: integer
debug: false
2015-03-02 22:33:43.885 INFO Thread 1 Add node BB98405080A0002 127.0.0.1:3000
2015-03-02 22:33:44.111 write(tps=902 timeouts=0 errors=0) read(tps=979 timeouts=0 errors=0) total(tps=1881 timeouts=0 errors=0)
2015-03-02 22:33:45.112 write(tps=8463 timeouts=0 errors=0) read(tps=8432 timeouts=0 errors=0) total(tps=16895 timeouts=0 errors=0)
2015-03-02 22:33:46.112 write(tps=9539 timeouts=0 errors=0) read(tps=9579 timeouts=0 errors=0) total(tps=19118 timeouts=0 errors=0)

I only noticed then that both nodes were marked red. I hadn’t checked during the initial long run while both nodes were working.


#3

Thanks for reaching out. The CE does allow multiple nodes per cluster. There are no limitations.

Did you configure your nodes to form a cluster across those VMs?

Could you share config files for your nodes and last lines of each node’s log file (under /var/log/aerospike/aerospike.log).

Thanks, –meher