Aerospike Cluster Automatically Errors


#1

Hi,

I have been using aerospike replicated and clustered on 2 machines that are sitting right next to each other. Here is the error that I get:

Nov 26 2015 08:02:50 GMT: ERROR(11563) [connect.cc:69] [Connect] - Connecting to Cluster Failed
api-7001-0 Connection to Aerospike cluster failed!
api-7001-0 { code: -1,
api-7001-0   message: 'Failed to seed cluster',
api-7001-0   func: 'as_cluster_seed_nodes',
api-7001-0   file: 'src/main/aerospike/as_cluster.c',
api-7001-0   line: 319 }

Now this happens automatically out of nowhere when the cluster was working perfectly. During this time, when I run the AQL command, I get the following:

root@v145:/usr/local/web/api# aql
2015-11-26 07:59:20 WARN AEROSPIKE_ERR_CLIENT Socket write error: 111
Error -1: Failed to seed cluster

Now this completely chokes my application on startup as the first thing I do is establish connections with all my DBs. Aeropsike connection fails and from there the app becomes completely inaccessible until i restart aerospike and the node processes. Could someone please help me with the cause of this issue and what can be done to solve this issue.

Restarting the cluster has its disadvantages and not to mention the time delay. My aerospike configuration is as follows:

# Aerospike database configuration file for deployments using mesh heartbeats.

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 4
	transaction-queues 4
	transaction-threads-per-queue 4
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any warning
		context udf debug
	}
}

network {
	service {
		address any
		port 3000
	}

	heartbeat {
		mode mesh
		port 3002 # Heartbeat port for this node.

		# List one or more other nodes, one ip-address & port per line:
		mesh-seed-address-port [ip_address1] 3002
		mesh-seed-address-port [ip_address2] 3002

		interval 250
		timeout 10
	}

	fabric {
		port 3001
	}

	info {
		port 3003
	}
}

namespace tlkn {
	replication-factor 2
	memory-size 2G
	default-ttl 0 # 30 days, use 0 to never expire/evict.

	ldt-enabled true

	storage-engine memory

	# To use file storage backing, comment out the line above and use the
	# following lines instead.
	storage-engine device {
		file /opt/aerospike/data/bar.dat
		filesize 8G
		data-in-memory true # Store data in memory in addition to file.
	}
}

#2

Though I doubt this is related, you should only have a single storage engine defined per namespace.

Are there any warnings in Aerospike’s log?


#3

Hi @srinivasiyer,

  • Are you on any cloud service (EC2, Google Cloud, Azure)?
  • What is the IP address you are using in the client to connect to aerospike server?
  • Can you access your server machine from the client machine independently? (you can verify this with a ping command).

Please provide us with above information for further debugging.

Thanks

Gayathri


#4

Hi @gayathri,

Sorry for the very delayed response. But the problem I have at hand is grave.

  1. I am not on any cloud service. The servers are our own and are co-located in our Data Center in Bombay and New York.
  2. I use the public IP address to connect to the aerospike server. The public IP address is configured on the switch and hence, the client does not have to go through the outside network to connect. I tried to connect using private IP but the unicast mesh heartbeat failed to create a cluster. Clustering for some strange reason was only possible on the public IP.
  3. The two servers are capable of pinging each other. I also run a mysql replication between the 2 servers and it has been seamless so far.

This time around I restarted both the servers and it is still not forming a cluster.

error":{
    "code":11,
    "message":"AEROSPIKE_ERR_CLUSTER",
    "func":"as_command_parse_result",
    "file":"src/main/aerospike/as_command.c",
    "line":1082
}

Also, this happens very randomly out of the blue. It is not an effect of any code changes. Also I lost all of my data around this time when I restarted the server. That is a bit concerning. While we are in development now, it would be a huge blow if something like this happened in production.

My configuration is the same as earlier except I upgraded aerospike to the latest version ‘aerospike-server-community-3.7.1-ubuntu12.04’

Here are the logs from my 2 servers. Let me know if you can identify the cause of this:

Server 1:

Jan 18 2016 08:07:06 GMT: INFO (paxos): (paxos.c::2744) as_paxos_retransmit_check: principal bb98e8317005452 retransmitting sync messages to nodes that have not responded yet ...
Jan 18 2016 08:07:06 GMT: INFO (paxos): (paxos.c::1497) sending sync message to bb985322b005452
Jan 18 2016 08:07:06 GMT: INFO (paxos): (paxos.c::1506) SUCCESSION [1]@bb98e8317005452: bb98e8317005452 bb985322b005452
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5084)  system memory: free 6825464kb ( 83 percent free )
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5090)  ClusterSize 1 ::: objects 0 ::: sub_objects 0
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5099)  rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5104)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5114)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (8, 85, 77) : hb (2, 2, 0) : fab (20, 20, 0)
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5116)    heartbeat_received: self 0 : foreign 16113
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5117)    heartbeat_stats: bt 0 bf 16074 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 3 efd 0 efa 3 um 0 mcf 0 rc 0
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5129)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5146) {tlkn} disk bytes used 0 : avail pct 99
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5148) {tlkn} memory bytes used 0 (index 0 : sindex 0 : data 0) : used pct 0.00
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5171) {tlkn} ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5196) {tlkn} migrations - complete
Jan 18 2016 08:07:09 GMT: INFO (info): (thr_info.c::5203)    partitions: actual 4096 sync 0 desync 0 zombie 0 absent 0
Jan 18 2016 08:07:09 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Jan 18 2016 08:07:09 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jan 18 2016 08:07:09 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jan 18 2016 08:07:09 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jan 18 2016 08:07:09 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jan 18 2016 08:07:09 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jan 18 2016 08:07:11 GMT: INFO (paxos): (paxos.c::2744) as_paxos_retransmit_check: principal bb98e8317005452 retransmitting sync messages to nodes that have not responded yet ...
Jan 18 2016 08:07:11 GMT: INFO (paxos): (paxos.c::1497) sending sync message to bb985322b005452
Jan 18 2016 08:07:11 GMT: INFO (paxos): (paxos.c::1506) SUCCESSION [1]@bb98e8317005452: bb98e8317005452 bb985322b005452

Server 2:

Jan 18 2016 08:08:46 GMT: WARNING (paxos): (paxos.c::407) Sync message ignored from bb98e8317005452 - [1]@bb98e8317005452 is arriving after [3]@bb98e8317005452
Jan 18 2016 08:08:46 GMT: WARNING (paxos): (paxos.c::3187) unable to apply received state from sync msg
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5084)  system memory: free 6783648kb ( 82 percent free )
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5090)  ClusterSize 2 ::: objects 0 ::: sub_objects 0
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5099)  rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 1070 ::: mig rx 0
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5104)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5114)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (8, 195, 187) : hb (2, 4, 2) : fab (20, 25, 5)
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5116)    heartbeat_received: self 0 : foreign 16977
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5117)    heartbeat_stats: bt 0 bf 16927 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 3 efd 0 efa 3 um 0 mcf 0 rc 2
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5129)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 1070 ssdr 0 ssdw 0 rw 0
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5146) {tlkn} disk bytes used 0 : avail pct 99
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5148) {tlkn} memory bytes used 0 (index 0 : sindex 0 : data 0) : used pct 0.00
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5171) {tlkn} ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5194) {tlkn} migrations - remaining (2121 tx, 3101 rx), active (1 tx, 0 rx), 36.25% complete
Jan 18 2016 08:08:51 GMT: INFO (info): (thr_info.c::5203)    partitions: actual 2046 sync 2050 desync 0 zombie 0 absent 0
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::137) histogram dump: reads (14 total) msec
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::163)  (00: 0000000014)
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jan 18 2016 08:08:51 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jan 18 2016 08:08:51 GMT: WARNING (paxos): (paxos.c::407) Sync message ignored from bb98e8317005452 - [1]@bb98e8317005452 is arriving after [3]@bb98e8317005452
Jan 18 2016 08:08:51 GMT: WARNING (paxos): (paxos.c::3187) unable to apply received state from sync msg


Jan 18 2016 08:08:54 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /opt/aerospike/data/bar.dat: used 0, contig-free 8191M (8191 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s) defrag-w-tot 0 (0.0/s)