Aerospike Server 3.7.2 Enterprise Java Client 3.1.8
I have 3 aerospike nodes.
After 3 days of up time, I get to this state:
All 3 nodes see each other as up and valid.
Aerospike node on 10.0.57.237 disconnected from all of its clients, and the clients can’t reconnect to it, even though they keep trying.
Aerospike node on 10.0.57.238 is a valid node, I am adding logs regarding this node as a comparison.
[root@centos-devel-rnd-1 ~]# nc 10.0.57.237 3000 // WE SEE ONLY OUTGOING TRAFFIC AND NO INCOMING TRAFFIC
Tcpdump output:
18:46:44.774801 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315045493 ecr 0,nop,wscale 7], length 0
18:46:45.774099 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315046493 ecr 0,nop,wscale 7], length 0
18:46:47.774093 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315048493 ecr 0,nop,wscale 7], length 0
18:46:51.774087 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315052493 ecr 0,nop,wscale 7], length 0
18:46:59.774124 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315060493 ecr 0,nop,wscale 7], length 0
18:47:15.774112 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315076493 ecr 0,nop,wscale 7], length 0
[root@centos-devel-rnd-1 ~]# nc 10.0.57.237 22 // SSH TRAFFIC IS TWO DIRECTIONAL, SHOWING THE MACHINE ITSELF CAN ACCEPT CONNECTIONS
Tcpdump output:
18:47:56.518008 IP centos-devel-rnd-1.contextream.local.52800 > 10.0.57.237.ssh: Flags [S], seq 1260813098, win 14600, options [mss 1460,sackOK,TS val 2315117236 ecr 0,nop,wscale 7], length 0
18:47:56.518610 IP 10.0.57.237.ssh > centos-devel-rnd-1.contextream.local.52800: Flags [S.], seq 1423742651, ack 1260813099, win 13480, options [mss 1360,sackOK,TS val 375354279 ecr 2315117236,nop,wscale 9], length 0
18:47:56.518650 IP centos-devel-rnd-1.contextream.local.52800 > 10.0.57.237.ssh: Flags [.], ack 1, win 115, options [nop,nop,TS val 2315117237 ecr 375354279], length 0
18:47:56.524533 IP 10.0.57.237.ssh > centos-devel-rnd-1.contextream.local.52800: Flags [P.], seq 1:22, ack 1, win 27, options [nop,nop,TS val 375354285 ecr 2315117237], length 21
18:47:56.524557 IP centos-devel-rnd-1.contextream.local.52800 > 10.0.57.237.ssh: Flags [.], ack 22, win 115, options [nop,nop,TS val 2315117243 ecr 375354285], length 0
[root@centos-devel-rnd-1 ~]# nc 10.0.57.238 3000 // PORT 3000 ON THE VALID NODE SHOW TWO-DIRECTIONAL TRAFFIC
Tcpdump output:
18:48:50.356771 IP centos-devel-rnd-1.contextream.local.35639 > 10.0.57.238.hbci: Flags [S], seq 1277805552, win 14600, options [mss 1460,sackOK,TS val 2315171075 ecr 0,nop,wscale 7], length 0
18:48:50.357411 IP 10.0.57.238.hbci > centos-devel-rnd-1.contextream.local.35639: Flags [S.], seq 4081678047, ack 1277805553, win 13480, options [mss 1360,sackOK,TS val 375405293 ecr 2315171075,nop,wscale 9], length 0
18:48:50.357478 IP centos-devel-rnd-1.contextream.local.35639 > 10.0.57.238.hbci: Flags [.], ack 1, win 115, options [nop,nop,TS val 2315171076 ecr 375405293], length 0
18:49:50.955167 IP 10.0.57.238.hbci > centos-devel-rnd-1.contextream.local.35639: Flags [F.], seq 1, ack 1, win 27, options [nop,nop,TS val 375465890 ecr 2315171076], length 0
18:49:50.955531 IP centos-devel-rnd-1.contextream.local.35639 > 10.0.57.238.hbci: Flags [F.], seq 1, ack 2, win 115, options [nop,nop,TS val 2315231674 ecr 375465890], length 0
18:49:50.956120 IP 10.0.57.238.hbci > centos-devel-rnd-1.contextream.local.35639: Flags [.], ack 2, win 27, options [nop,nop,TS val 375465891 ecr 2315231674], length 0
[root@ODL1 ~]# lsof -i :3000 // OPEN FILES ON PORT 3000 IN THE BAD MACHINE
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
asd 32744 root 62u IPv4 59748 0t0 TCP *:hbci (LISTEN)
asd 32744 root 68u IPv4 4981013 0t0 TCP ODL1:hbci->10.5.5.123:46382 (CLOSE_WAIT)
asd 32744 root 98u IPv4 11191866 0t0 TCP ODL1:hbci->10.5.5.122:41916 (CLOSE_WAIT)
asd 32744 root 101u IPv4 57014482 0t0 TCP ODL1:hbci->10.5.5.122:47405 (CLOSE_WAIT)
asd 32744 root 118u IPv4 57018263 0t0 TCP ODL1:hbci->10.5.5.102:60454 (CLOSE_WAIT)
asd 32744 root 119u IPv4 57018304 0t0 TCP ODL1:hbci->10.5.5.122:47406 (CLOSE_WAIT)
asd 32744 root 120u IPv4 57020234 0t0 TCP ODL1:hbci->10.5.5.122:47407 (CLOSE_WAIT)
asd 32744 root 158u IPv4 35908302 0t0 TCP ODL1:hbci->10.5.5.122:44478 (CLOSE_WAIT)
asd 32744 root 162u IPv4 4981761 0t0 TCP ODL1:hbci->10.5.5.122:41281 (CLOSE_WAIT)
asd 32744 root 163u IPv4 4981715 0t0 TCP ODL1:hbci->10.5.5.122:41278 (CLOSE_WAIT)
asd 32744 root 164u IPv4 4981763 0t0 TCP ODL1:hbci->10.5.5.122:41282 (CLOSE_WAIT)
asd 32744 root 165u IPv4 4981764 0t0 TCP ODL1:hbci->10.5.5.122:41283 (CLOSE_WAIT)
java 37190 root 94u IPv6 56971026 0t0 TCP ODL1:49841->ODL1:hbci (CLOSE_WAIT)
java 37190 root 102u IPv6 57525905 0t0 TCP ODL1:53218->ODL1:hbci (SYN_SENT)
java 37190 root 108u IPv6 56973294 0t0 TCP ODL1:35037->10.5.5.123:hbci (CLOSE_WAIT)
java 37190 root 193u IPv6 57523588 0t0 TCP ODL1:53205->ODL1:hbci (SYN_SENT)
java 37190 root 194u IPv6 15186760 0t0 TCP ODL1:34457->10.5.5.122:hbci (ESTABLISHED)
java 37190 root 195u IPv6 15186761 0t0 TCP ODL1:58143->10.5.5.123:hbci (ESTABLISHED)
java 37190 root 196u IPv6 15186762 0t0 TCP ODL1:34459->10.5.5.122:hbci (ESTABLISHED)
java 37190 root 197u IPv6 15186763 0t0 TCP ODL1:34460->10.5.5.122:hbci (ESTABLISHED)
java 37190 root 198u IPv6 15186764 0t0 TCP ODL1:34461->10.5.5.122:hbci (ESTABLISHED)
java 37190 root 202u IPv6 15186771 0t0 TCP ODL1:58150->10.5.5.123:hbci (ESTABLISHED)
java 37190 root 203u IPv6 15186772 0t0 TCP ODL1:58151->10.5.5.123:hbci (ESTABLISHED)
java 37190 root 204u IPv6 15186773 0t0 TCP ODL1:58152->10.5.5.123:hbci (ESTABLISHED)
java 37190 root 208u IPv6 57524362 0t0 TCP ODL1:53214->ODL1:hbci (ESTABLISHED)
java 37190 root 219u IPv6 57525967 0t0 TCP ODL1:53220->ODL1:hbci (ESTABLISHED)
java 37190 root 230u IPv6 34672088 0t0 TCP ODL1:36450->10.5.5.122:hbci (ESTABLISHED)
java 37190 root 239u IPv6 56795184 0t0 TCP ODL1:39555->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 243u IPv6 56795194 0t0 TCP ODL1:39561->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 245u IPv6 56795180 0t0 TCP ODL1:39552->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 246u IPv6 56795176 0t0 TCP ODL1:39549->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 249u IPv6 56795195 0t0 TCP ODL1:39562->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 250u IPv6 56795175 0t0 TCP ODL1:39550->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 252u IPv6 56795196 0t0 TCP ODL1:39563->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 253u IPv6 56795197 0t0 TCP ODL1:39564->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 260u IPv6 34704541 0t0 TCP ODL1:60169->10.5.5.123:hbci (ESTABLISHED)
java 37190 root 262u IPv6 56795250 0t0 TCP ODL1:39570->10.5.5.122:hbci (CLOSE_WAIT)
java 37190 root 265u IPv6 56795306 0t0 TCP ODL1:39572->10.5.5.122:hbci (CLOSE_WAIT)
[root@ODL1 ~]# aql // AQL DOESN’T WORK ON THE BAD MACHINE
2016-02-15 19:02:00 WARN AEROSPIKE_ERR_TIMEOUT Error -1: Failed to seed cluster
[root@ODL1 ~]# asinfo // ASINFO ALSO CANT CONNECT TO PORT 3000
request to 127.0.0.1 : 3000 returned error
This is the result of ‘grep “starting|finished” aerospike.log’.
You can can see that in 15:55:58 it stopped processing the scans.
In 16:04:54 it stopped receiving any scans, probably because all clients already disconnected from it, and can’t reconnect.
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::648) starting basic scan job 372530086278853 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::711) finished basic scan job 372530086278853 (0)
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::648) starting basic scan job 372529920336872 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::711) finished basic scan job 372529920336872 (0)
Feb 15 2016 15:55:55 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532812666049 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:55 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532812666049 (0)
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::648) starting basic scan job 372533813054184 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::711) finished basic scan job 372533813054184 (0)
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::648) starting basic scan job 372531286113207 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::711) finished basic scan job 372531286113207 (0)
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::648) starting basic scan job 372531120144258 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::711) finished basic scan job 372531120144258 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532286405445 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532286405445 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532120431155 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532120431155 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372535012471394 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372535012471394 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532486816726 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532486816726 (0)
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::648) starting basic scan job 372536012405702 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::711) finished basic scan job 372536012405702 (0)
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::648) starting basic scan job 372533485961684 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::711) finished basic scan job 372533485961684 (0)
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::648) starting basic scan job 372533320529258 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:56:16 GMT: INFO (scan): (scan.c::648) starting basic scan job 372551325234792 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:56:34 GMT: INFO (scan): (scan.c::648) starting basic scan job 372569340745032 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:56:52 GMT: INFO (scan): (scan.c::648) starting basic scan job 372587356267102 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:57:10 GMT: INFO (scan): (scan.c::648) starting basic scan job 372605371408710 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:57:28 GMT: INFO (scan): (scan.c::648) starting basic scan job 372623386762906 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:57:46 GMT: INFO (scan): (scan.c::648) starting basic scan job 372641402198622 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:58:04 GMT: INFO (scan): (scan.c::648) starting basic scan job 372659417658962 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:58:22 GMT: INFO (scan): (scan.c::648) starting basic scan job 372677433135577 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:59:32 GMT: INFO (scan): (scan.c::648) starting basic scan job 372756866416151 {global:topologymodel} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 16:04:54 GMT: INFO (scan): (scan.c::648) starting basic scan job 373070085515291 {global:topologymodel} priority 2, sample-pct 100, fail-on-cluster-change
The client notice the problem on 18:04:56 (Which is 16:04:56 GMT):
[2016-02-15 18:04:56,012] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 18:04:57,264] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 18:04:59,517] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 18:05:00,770] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: Error Code 11: java.net.SocketTimeoutException: connect timed out
A different client which notices the problem on 17:58:15 (15:58:15 GMT):
[2016-02-15 17:58:15,117] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:16,369] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:22,892] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:24,145] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:25,397] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:26,650] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:27,902] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:29,154] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:30,407] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:31,659] INFO : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
The worst thing about this state is that the other aerospike nodes still see this aerospike node as alive, so they don’t take it out of the cluster. The aerospike client’s tend thread won’t remove it from the known cluster as long as the “good” aerospike nodes tell him it’s up. Even if we catch this type of Exception in the client, and invoke node.close(), we will still be in a bad state, because a data migration didn’t happen, and scans for example would give us partial results.
This log output is what we see in the bad aerospike node since the last scan request received. It keeps writing this data repetitively until I restart the node:
Feb 15 2016 16:05:07 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /var/log/contextream/xl/aerospike_data/regional.dat: used 3016064, contig-free 8187M (65503 wblocks), swb-free 2, w-q 0 w-tot 29959 (0.0/s), defrag-q 0 defrag-tot 292
36 (0.0/s) defrag-w-tot 13721 (0.0/s)
Feb 15 2016 16:05:08 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /var/log/contextream/xl/aerospike_data/local.dat: used 768, contig-free 498M (3990 wblocks), swb-free 1, w-q 0 w-tot 2189 (0.0/s), defrag-q 0 defrag-tot 2182 (0.0/s)
defrag-w-tot 8 (0.0/s)
Feb 15 2016 16:05:08 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /var/log/contextream/xl/aerospike_data/global.dat: used 724224, contig-free 497M (3982 wblocks), swb-free 2, w-q 0 w-tot 18037 (0.2/s), defrag-q 0 defrag-tot 18022 (0
.2/s) defrag-w-tot 1737 (0.1/s)
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1206) {global} nsup start
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1130) {global} Records: 353, 352 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 2 ms
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1206) {regional} nsup start
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1130) {regional} Records: 1306, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 3 ms
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1206) {local} nsup start
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1130) {local} Records: 0, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 1 ms