Aerospike Node suddenly can't receive client connections


#1

Aerospike Server 3.7.2 Enterprise Java Client 3.1.8

I have 3 aerospike nodes.

After 3 days of up time, I get to this state:

All 3 nodes see each other as up and valid.

Aerospike node on 10.0.57.237 disconnected from all of its clients, and the clients can’t reconnect to it, even though they keep trying.

Aerospike node on 10.0.57.238 is a valid node, I am adding logs regarding this node as a comparison.

[root@centos-devel-rnd-1 ~]# nc 10.0.57.237 3000 // WE SEE ONLY OUTGOING TRAFFIC AND NO INCOMING TRAFFIC

Tcpdump output:

18:46:44.774801 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315045493 ecr 0,nop,wscale 7], length 0
18:46:45.774099 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315046493 ecr 0,nop,wscale 7], length 0
18:46:47.774093 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315048493 ecr 0,nop,wscale 7], length 0
18:46:51.774087 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315052493 ecr 0,nop,wscale 7], length 0
18:46:59.774124 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315060493 ecr 0,nop,wscale 7], length 0
18:47:15.774112 IP centos-devel-rnd-1.contextream.local.47550 > 10.0.57.237.hbci: Flags [S], seq 2211837179, win 14600, options [mss 1460,sackOK,TS val 2315076493 ecr 0,nop,wscale 7], length 0

[root@centos-devel-rnd-1 ~]# nc 10.0.57.237 22 // SSH TRAFFIC IS TWO DIRECTIONAL, SHOWING THE MACHINE ITSELF CAN ACCEPT CONNECTIONS

Tcpdump output:

18:47:56.518008 IP centos-devel-rnd-1.contextream.local.52800 > 10.0.57.237.ssh: Flags [S], seq 1260813098, win 14600, options [mss 1460,sackOK,TS val 2315117236 ecr 0,nop,wscale 7], length 0
18:47:56.518610 IP 10.0.57.237.ssh > centos-devel-rnd-1.contextream.local.52800: Flags [S.], seq 1423742651, ack 1260813099, win 13480, options [mss 1360,sackOK,TS val 375354279 ecr 2315117236,nop,wscale 9], length 0
18:47:56.518650 IP centos-devel-rnd-1.contextream.local.52800 > 10.0.57.237.ssh: Flags [.], ack 1, win 115, options [nop,nop,TS val 2315117237 ecr 375354279], length 0
18:47:56.524533 IP 10.0.57.237.ssh > centos-devel-rnd-1.contextream.local.52800: Flags [P.], seq 1:22, ack 1, win 27, options [nop,nop,TS val 375354285 ecr 2315117237], length 21
18:47:56.524557 IP centos-devel-rnd-1.contextream.local.52800 > 10.0.57.237.ssh: Flags [.], ack 22, win 115, options [nop,nop,TS val 2315117243 ecr 375354285], length 0

[root@centos-devel-rnd-1 ~]# nc 10.0.57.238 3000 // PORT 3000 ON THE VALID NODE SHOW TWO-DIRECTIONAL TRAFFIC

Tcpdump output:

18:48:50.356771 IP centos-devel-rnd-1.contextream.local.35639 > 10.0.57.238.hbci: Flags [S], seq 1277805552, win 14600, options [mss 1460,sackOK,TS val 2315171075 ecr 0,nop,wscale 7], length 0
18:48:50.357411 IP 10.0.57.238.hbci > centos-devel-rnd-1.contextream.local.35639: Flags [S.], seq 4081678047, ack 1277805553, win 13480, options [mss 1360,sackOK,TS val 375405293 ecr 2315171075,nop,wscale 9], length 0
18:48:50.357478 IP centos-devel-rnd-1.contextream.local.35639 > 10.0.57.238.hbci: Flags [.], ack 1, win 115, options [nop,nop,TS val 2315171076 ecr 375405293], length 0
18:49:50.955167 IP 10.0.57.238.hbci > centos-devel-rnd-1.contextream.local.35639: Flags [F.], seq 1, ack 1, win 27, options [nop,nop,TS val 375465890 ecr 2315171076], length 0
18:49:50.955531 IP centos-devel-rnd-1.contextream.local.35639 > 10.0.57.238.hbci: Flags [F.], seq 1, ack 2, win 115, options [nop,nop,TS val 2315231674 ecr 375465890], length 0
18:49:50.956120 IP 10.0.57.238.hbci > centos-devel-rnd-1.contextream.local.35639: Flags [.], ack 2, win 27, options [nop,nop,TS val 375465891 ecr 2315231674], length 0

[root@ODL1 ~]# lsof -i :3000 // OPEN FILES ON PORT 3000 IN THE BAD MACHINE

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

asd     32744 root   62u  IPv4    59748      0t0  TCP *:hbci (LISTEN)
asd     32744 root   68u  IPv4  4981013      0t0  TCP ODL1:hbci->10.5.5.123:46382 (CLOSE_WAIT)
asd     32744 root   98u  IPv4 11191866      0t0  TCP ODL1:hbci->10.5.5.122:41916 (CLOSE_WAIT)
asd     32744 root  101u  IPv4 57014482      0t0  TCP ODL1:hbci->10.5.5.122:47405 (CLOSE_WAIT)
asd     32744 root  118u  IPv4 57018263      0t0  TCP ODL1:hbci->10.5.5.102:60454 (CLOSE_WAIT)
asd     32744 root  119u  IPv4 57018304      0t0  TCP ODL1:hbci->10.5.5.122:47406 (CLOSE_WAIT)
asd     32744 root  120u  IPv4 57020234      0t0  TCP ODL1:hbci->10.5.5.122:47407 (CLOSE_WAIT)
asd     32744 root  158u  IPv4 35908302      0t0  TCP ODL1:hbci->10.5.5.122:44478 (CLOSE_WAIT)
asd     32744 root  162u  IPv4  4981761      0t0  TCP ODL1:hbci->10.5.5.122:41281 (CLOSE_WAIT)
asd     32744 root  163u  IPv4  4981715      0t0  TCP ODL1:hbci->10.5.5.122:41278 (CLOSE_WAIT)
asd     32744 root  164u  IPv4  4981763      0t0  TCP ODL1:hbci->10.5.5.122:41282 (CLOSE_WAIT)
asd     32744 root  165u  IPv4  4981764      0t0  TCP ODL1:hbci->10.5.5.122:41283 (CLOSE_WAIT)
java    37190 root   94u  IPv6 56971026      0t0  TCP ODL1:49841->ODL1:hbci (CLOSE_WAIT)
java    37190 root  102u  IPv6 57525905      0t0  TCP ODL1:53218->ODL1:hbci (SYN_SENT)
java    37190 root  108u  IPv6 56973294      0t0  TCP ODL1:35037->10.5.5.123:hbci (CLOSE_WAIT)
java    37190 root  193u  IPv6 57523588      0t0  TCP ODL1:53205->ODL1:hbci (SYN_SENT)
java    37190 root  194u  IPv6 15186760      0t0  TCP ODL1:34457->10.5.5.122:hbci (ESTABLISHED)
java    37190 root  195u  IPv6 15186761      0t0  TCP ODL1:58143->10.5.5.123:hbci (ESTABLISHED)
java    37190 root  196u  IPv6 15186762      0t0  TCP ODL1:34459->10.5.5.122:hbci (ESTABLISHED)
java    37190 root  197u  IPv6 15186763      0t0  TCP ODL1:34460->10.5.5.122:hbci (ESTABLISHED)
java    37190 root  198u  IPv6 15186764      0t0  TCP ODL1:34461->10.5.5.122:hbci (ESTABLISHED)
java    37190 root  202u  IPv6 15186771      0t0  TCP ODL1:58150->10.5.5.123:hbci (ESTABLISHED)
java    37190 root  203u  IPv6 15186772      0t0  TCP ODL1:58151->10.5.5.123:hbci (ESTABLISHED)
java    37190 root  204u  IPv6 15186773      0t0  TCP ODL1:58152->10.5.5.123:hbci (ESTABLISHED)
java    37190 root  208u  IPv6 57524362      0t0  TCP ODL1:53214->ODL1:hbci (ESTABLISHED)
java    37190 root  219u  IPv6 57525967      0t0  TCP ODL1:53220->ODL1:hbci (ESTABLISHED)
java    37190 root  230u  IPv6 34672088      0t0  TCP ODL1:36450->10.5.5.122:hbci (ESTABLISHED)
java    37190 root  239u  IPv6 56795184      0t0  TCP ODL1:39555->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  243u  IPv6 56795194      0t0  TCP ODL1:39561->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  245u  IPv6 56795180      0t0  TCP ODL1:39552->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  246u  IPv6 56795176      0t0  TCP ODL1:39549->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  249u  IPv6 56795195      0t0  TCP ODL1:39562->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  250u  IPv6 56795175      0t0  TCP ODL1:39550->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  252u  IPv6 56795196      0t0  TCP ODL1:39563->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  253u  IPv6 56795197      0t0  TCP ODL1:39564->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  260u  IPv6 34704541      0t0  TCP ODL1:60169->10.5.5.123:hbci (ESTABLISHED)
java    37190 root  262u  IPv6 56795250      0t0  TCP ODL1:39570->10.5.5.122:hbci (CLOSE_WAIT)
java    37190 root  265u  IPv6 56795306      0t0  TCP ODL1:39572->10.5.5.122:hbci (CLOSE_WAIT)

[root@ODL1 ~]# aql // AQL DOESN’T WORK ON THE BAD MACHINE

2016-02-15 19:02:00 WARN AEROSPIKE_ERR_TIMEOUT Error -1: Failed to seed cluster

[root@ODL1 ~]# asinfo // ASINFO ALSO CANT CONNECT TO PORT 3000

request to 127.0.0.1 : 3000 returned error

This is the result of ‘grep “starting|finished” aerospike.log’.

You can can see that in 15:55:58 it stopped processing the scans.

In 16:04:54 it stopped receiving any scans, probably because all clients already disconnected from it, and can’t reconnect.

Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::648) starting basic scan job 372530086278853 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::711) finished basic scan job 372530086278853 (0)
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::648) starting basic scan job 372529920336872 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:54 GMT: INFO (scan): (scan.c::711) finished basic scan job 372529920336872 (0)
Feb 15 2016 15:55:55 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532812666049 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:55 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532812666049 (0)
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::648) starting basic scan job 372533813054184 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::711) finished basic scan job 372533813054184 (0)
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::648) starting basic scan job 372531286113207 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::711) finished basic scan job 372531286113207 (0)
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::648) starting basic scan job 372531120144258 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:56 GMT: INFO (scan): (scan.c::711) finished basic scan job 372531120144258 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532286405445 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532286405445 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532120431155 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532120431155 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372535012471394 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372535012471394 (0)
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::648) starting basic scan job 372532486816726 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:57 GMT: INFO (scan): (scan.c::711) finished basic scan job 372532486816726 (0)
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::648) starting basic scan job 372536012405702 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::711) finished basic scan job 372536012405702 (0)
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::648) starting basic scan job 372533485961684 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::711) finished basic scan job 372533485961684 (0)
Feb 15 2016 15:55:58 GMT: INFO (scan): (scan.c::648) starting basic scan job 372533320529258 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:56:16 GMT: INFO (scan): (scan.c::648) starting basic scan job 372551325234792 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:56:34 GMT: INFO (scan): (scan.c::648) starting basic scan job 372569340745032 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:56:52 GMT: INFO (scan): (scan.c::648) starting basic scan job 372587356267102 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:57:10 GMT: INFO (scan): (scan.c::648) starting basic scan job 372605371408710 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:57:28 GMT: INFO (scan): (scan.c::648) starting basic scan job 372623386762906 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:57:46 GMT: INFO (scan): (scan.c::648) starting basic scan job 372641402198622 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:58:04 GMT: INFO (scan): (scan.c::648) starting basic scan job 372659417658962 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:58:22 GMT: INFO (scan): (scan.c::648) starting basic scan job 372677433135577 {global:tap} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 15:59:32 GMT: INFO (scan): (scan.c::648) starting basic scan job 372756866416151 {global:topologymodel} priority 2, sample-pct 100, fail-on-cluster-change
Feb 15 2016 16:04:54 GMT: INFO (scan): (scan.c::648) starting basic scan job 373070085515291 {global:topologymodel} priority 2, sample-pct 100, fail-on-cluster-change

The client notice the problem on 18:04:56 (Which is 16:04:56 GMT):

[2016-02-15 18:04:56,012] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 18:04:57,264] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 18:04:59,517] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 18:05:00,770] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: Error Code 11: java.net.SocketTimeoutException: connect timed out

A different client which notices the problem on 17:58:15 (15:58:15 GMT):

[2016-02-15 17:58:15,117] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:16,369] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:22,892] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:24,145] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:25,397] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:26,650] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:27,902] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:29,154] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:30,407] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out
[2016-02-15 17:58:31,659] INFO  : (com.cxtrm.contexmap.aerospike.impl.AerospikeConnector:110) - [tend] Node BB96809463E16FA 10.5.5.121:3000 refresh failed: java.net.SocketTimeoutException: Read timed out

The worst thing about this state is that the other aerospike nodes still see this aerospike node as alive, so they don’t take it out of the cluster. The aerospike client’s tend thread won’t remove it from the known cluster as long as the “good” aerospike nodes tell him it’s up. Even if we catch this type of Exception in the client, and invoke node.close(), we will still be in a bad state, because a data migration didn’t happen, and scans for example would give us partial results.

This log output is what we see in the bad aerospike node since the last scan request received. It keeps writing this data repetitively until I restart the node:

Feb 15 2016 16:05:07 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /var/log/contextream/xl/aerospike_data/regional.dat: used 3016064, contig-free 8187M (65503 wblocks), swb-free 2, w-q 0 w-tot 29959 (0.0/s), defrag-q 0 defrag-tot 292
36 (0.0/s) defrag-w-tot 13721 (0.0/s)
Feb 15 2016 16:05:08 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /var/log/contextream/xl/aerospike_data/local.dat: used 768, contig-free 498M (3990 wblocks), swb-free 1, w-q 0 w-tot 2189 (0.0/s), defrag-q 0 defrag-tot 2182 (0.0/s) 
defrag-w-tot 8 (0.0/s)
Feb 15 2016 16:05:08 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /var/log/contextream/xl/aerospike_data/global.dat: used 724224, contig-free 497M (3982 wblocks), swb-free 2, w-q 0 w-tot 18037 (0.2/s), defrag-q 0 defrag-tot 18022 (0
.2/s) defrag-w-tot 1737 (0.1/s)

Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1206) {global} nsup start
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1130) {global} Records: 353, 352 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 2 ms
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1206) {regional} nsup start
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1130) {regional} Records: 1306, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 3 ms
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1206) {local} nsup start
Feb 15 2016 16:06:12 GMT: INFO (nsup): (thr_nsup.c::1130) {local} Records: 0, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 1 ms

#2

Appears that the ulimit -n for the asd procesd is 1024 and you have exceeded that limit. The proto-fd-max cannot be less strict that the systems ulimit. Normally the init scripts increase the ulimit -n for asd to 100,000, if you have chamged which user runs asd you will need to modify that userd ulimit.

See /etc/security/limits.conf


#3

I updated my post. The mismatch was because I ran the ‘asd’ in the debugging process and it ran a new aerospike process without initiating the ulimit to 100,000. It is not related to the problem itself that happened hours prior to that.


#4

There is a log line containing the word “proto” which would be interesting to see.

  1. How many client apps are you running?
  2. Do the client apps often reinstantiate the areospike connection?
  3. Are you using the async or sync java APIs?

#5

This is the last log of the data around ‘proto’:

Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5039)  system memory: free 41984368kb ( 84 percent free ) 
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5045)  ClusterSize 3 ::: objects 3756 ::: sub_objects 0
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5054)  rec refs 3756 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5059)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0 
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5069)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (18, 4368, 4350) : hb (3, 15, 12) : fab (43, 116, 73)
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5071)    heartbeat_received: self 0 : foreign 10804233
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5072)    heartbeat_stats: bt 0 bf 7192761 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 9 efd 8 efa 1 um 0 mcf 0 rc 12 
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5084)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5101) {global} disk bytes used 724224 : avail pct 99
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5103) {global} memory bytes used 439657 (index 72576 : sindex 0 : data 367081) : used pct 0.42
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5151) {global} migrations - complete
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5101) {regional} disk bytes used 3016064 : avail pct 99
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5103) {regional} memory bytes used 2499599 (index 167616 : sindex 118120 : data 2213863) : used pct 0.03
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5151) {regional} migrations - complete
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5101) {local} disk bytes used 768 : avail pct 99
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5103) {local} memory bytes used 333 (index 192 : sindex 0 : data 141) : used pct 0.00
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5151) {local} migrations - complete
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5158)    partitions: actual 4107 sync 6816 desync 0 zombie 0 absent 1365
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::137) histogram dump: reads (567134485 total) msec
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::154)  (00: 0567130293) (01: 0000003815) (02: 0000000289) (03: 0000000028)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::163)  (04: 0000000059) (05: 0000000001)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (1601579 total) msec
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::154)  (00: 0001583790) (01: 0000014288) (02: 0000001793) (03: 0000000723)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::154)  (04: 0000000624) (05: 0000000355) (06: 0000000005) (13: 0000000001)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::137) histogram dump: proxy (2359 total) msec
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::154)  (00: 0000002339) (01: 0000000015) (04: 0000000001) (05: 0000000002)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::163)  (06: 0000000001) (07: 0000000001)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::137) histogram dump: query (26880 total) msec
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::163)  (00: 0000023381) (01: 0000003496) (02: 0000000003)
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (26880 total) count
Feb 15 2016 15:56:52 GMT: INFO (info): (hist.c::163)  (06: 0000025760) (07: 0000001120)
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5340) node id bb96809463e16fa
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5344) reads 47002149,520132336 : writes 1601578,0
Feb 15 2016 15:56:52 GMT: INFO (info): (thr_info.c::5348) udf reads 0,0 : udf writes 0,0 : udf deletes 0,0 : lua errors 0

Note that this log is printed at 15:56:52, and is not printed anymore. This is around the time when the scans stopped responding.

  1. We have 3 Java clients of our own, and 3 Spark clients too.

  2. Our Java clients don’t. But I think that the Spark slaves do reinstantiate the connections oftenly.

  3. We use both Sync and Async java APIs. We connect from the Java clients with AsyncClient.

This is an example of this output from a different valid node in the cluster while the bad node doesn’t accept client connections:

Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5039)  system memory: free 41852944kb ( 84 percent free ) 
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5045)  ClusterSize 3 ::: objects 3834 ::: sub_objects 0
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5054)  rec refs 3834 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5059)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0 
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5069)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (18, 5452, 5434) : hb (3, 19, 16) : fab (43, 139, 96)
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5071)    heartbeat_received: self 0 : foreign 10819760
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5072)    heartbeat_stats: bt 0 bf 7198510 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 12 efd 11 efa 1 um 0 mcf 0 rc 16 
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5084)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5101) {global} disk bytes used 724224 : avail pct 99
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5103) {global} memory bytes used 439657 (index 72576 : sindex 0 : data 367081) : used pct 0.42
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5151) {global} migrations - complete
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5101) {regional} disk bytes used 3107712 : avail pct 99
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5103) {regional} memory bytes used 2574958 (index 172608 : sindex 120696 : data 2281654) : used pct 0.03
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5151) {regional} migrations - complete
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5101) {local} disk bytes used 768 : avail pct 99
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5103) {local} memory bytes used 333 (index 192 : sindex 0 : data 141) : used pct 0.00
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5151) {local} migrations - complete
Feb 15 2016 16:05:20 GMT: INFO (info): (thr_info.c::5158)    partitions: actual 3963 sync 6899 desync 0 zombie 0 absent 1426
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::137) histogram dump: reads (510687775 total) msec
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::154)  (00: 0510682272) (01: 0000004694) (02: 0000000624) (03: 0000000127)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::154)  (04: 0000000050) (05: 0000000004) (06: 0000000001) (07: 0000000002)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::163)  (13: 0000000001)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (3003333 total) msec
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::154)  (00: 0002960232) (01: 0000033818) (02: 0000003900) (03: 0000001860)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::154)  (04: 0000002233) (05: 0000001272) (06: 0000000011) (07: 0000000001)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::163)  (08: 0000000002) (09: 0000000003) (13: 0000000001)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::137) histogram dump: proxy (1334 total) msec
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::154)  (00: 0000001310) (01: 0000000018) (02: 0000000004) (03: 0000000001)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::163)  (04: 0000000001)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::137) histogram dump: query (26880 total) msec
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::154)  (00: 0000019261) (01: 0000007598) (02: 0000000006) (03: 0000000006)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::163)  (04: 0000000002) (05: 0000000007)
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (26880 total) count
Feb 15 2016 16:05:20 GMT: INFO (info): (hist.c::163)  (06: 0000023520) (07: 0000003360)

#6

Hello Guy,

Just in case, what kind of Spark jobs are connecting to Aerospike?

Batches or streaming?

If you’re doing streaming, what is the batch streaming interval, and what is the level of parallelism (executors * cores) of your jobs?

Are those jobs written in Scala?

I ask you those questions because in Spark streaming you have to be very careful to avoid stacking connections.

The most common pattern (@transient lazy val connection = … ) is ok for batches jobs, but not adapted at all for spark streaming jobs : it will create (executors * cores) connections that are probably not closed properly at each batch interva ("@transient lazy val" are not reused over batches)…

If you think your problem may be related to what I described, I can give you some hints on how to solve it


#7

I will have someone respond to those questions.

Why do you even suspect open files?

We work with the default of ulimit 100k that Aerospike changes on startup, and this was validated in the /limits in the machine.

We also use the default config of 15k file descriptors for client connections.

So we are way way way under the limits.


#8

Hi, @nep

We’re not using Spark Streaming because we’re not dashboard or connecting to remote DB. Our client/driver code is with Java and we are connecting to Slaves which write in Scala. We’re using the AeroSpark connector, https://github.com/sasha-polev/aerospark I checked the code on aerospark and I saw that it calls AerospikeClient.

Sasi


#9

Hello,

Ok, so it feels it should not be the problem, sorry I’ve got no other ideas


#10

Hi nep,

I’m having this exact issue. I have a Spark streaming job written in Scala which persists data to Aerospike using the aerospike-spark client. It appears that client connections are not properly closed and connections are stacking across microbatch intervals. Will you please share your insights as to how to address this?