Hello,
I am desperately trying to form a Aerospike 3.5.9 Community edition cluster of two Docker containers running on two different physical hosts whose ips are 10.5.8.160 and 10.5.8.161. I would like to avoid using the --net=host parameter in the docker run command line mentioned in this thread.
Here is the template configuration file of each aerospike node:
service {
user root
group root
paxos-single-replica-limit 1
paxos-recovery-policy auto-dun-master
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 15000
}
logging {
file /storage/logs/aerospike.log {
context any info
}
console {
context any info
}
}
network {
service {
address any
port 3000
access-address NODE_ADDR virtual
network-interface-name eth0
}
fabric {
port 3001
}
heartbeat {
mode mesh
port 3002
mesh-address PEER_ADDR
mesh-port 3002
interval 150
timeout 20
}
info {
port 3003
}
}
namespace testdata {
replication-factor 2
memory-size 1G
default-ttl 5d
storage-engine device {
file /storage/data/testdata.dat
filesize 4G
data-in-memory true
}
}
I am running each container this way (some basic sed operation in a pre-run script changes the configuration template variables with the container environment variables just before asd --foreground is run):
docker run -tid -e NODE_ADDR=10.5.8.160 -e PEER_ADDR=10.5.8.161 -v /opt/docker/aerospike/data:/storage/data -v /opt/docker/aerospike/logs:/storage/logs --name aerospike-node1 -p 3000:3000 -p 3001:3001 -p 3002:3002 -p 3003:3003 <myImage>
and:
docker run -tid -e NODE_ADDR=10.5.8.161 -e PEER_ADDR=10.5.8.160 -v /opt/docker/aerospike/data:/storage/data -v /opt/docker/aerospike/logs:/storage/logs --name aerospike-node2 -p 3000:3000 -p 3001:3001 -p 3002:3002 -p 3003:3003 <myImage>
Each node starts pretty well and heartbeats seem to be sent normally, however the cluster is not able to form, despite the addition of the “paxos-recovery-policy auto-dun-master” setting in the service configuration.
Here is the log I am able to see on the container running on 10.5.8.160 (second container is run a few seconds later so the first heartbeat connection failure is expected:
May 20 2015 13:22:00 GMT: INFO (cf:misc): (id.c::119) Node ip: 172.17.0.49
May 20 2015 13:22:00 GMT: INFO (cf:misc): (id.c::265) Heartbeat address for mesh: 172.17.0.49
May 20 2015 13:22:00 GMT: INFO (config): (cfg.c::3099) Rack Aware mode not enabled
May 20 2015 13:22:00 GMT: INFO (config): (cfg.c::3102) Node id bb9310011ac4202
May 20 2015 13:22:00 GMT: INFO (namespace): (namespace_cold.c::101) ns testdata beginning COLD start
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::3980) opened file /storage/data/testdata.dat: usable size 4294967296
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::1070) /storage/data/testdata.dat has 4096 wblocks of size 1048576
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::3438) device /storage/data/testdata.dat: reading device to load index
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::3443) In TID 16: Using arena #150 for loading data for namespace "testdata"
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::3464) device /storage/data/testdata.dat: read complete: UNIQUE 0 (REPLACED 0) (GEN 0) (EXPIRED 0) (MAX-TTL 0) records
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::1035) ns testdata loading free & defrag queues
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::969) /storage/data/testdata.dat init defrag profile: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::1059) /storage/data/testdata.dat init wblock free-q 4095, defrag-q 0
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::2663) ns testdata starting device maintenance threads
May 20 2015 13:22:00 GMT: INFO (as): (as.c::410) initializing services...
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::1741) ns testdata starting write worker threads
May 20 2015 13:22:00 GMT: INFO (drv_ssd): (drv_ssd.c::886) ns testdata starting defrag threads
May 20 2015 13:22:00 GMT: INFO (tsvc): (thr_tsvc.c::1021) shared queues: 4 queues with 4 threads each
May 20 2015 13:22:00 GMT: INFO (hb): (hb.c::2459) heartbeat socket initialization
May 20 2015 13:22:00 GMT: INFO (hb): (hb.c::2473) initializing mesh heartbeat socket : 172.17.0.49:3002
May 20 2015 13:22:00 GMT: INFO (info): (thr_info.c::5224) static external network definition
May 20 2015 13:22:00 GMT: INFO (paxos): (paxos.c::3127) partitions from storage: total 4096 found 4096 lost(set) 0 lost(unset) 0
May 20 2015 13:22:00 GMT: INFO (partition): (partition.c::4011) {testdata} 4096 partitions: found 0 absent, 4096 stored
May 20 2015 13:22:00 GMT: INFO (paxos): (paxos.c::3131) Paxos service ignited: bb9310011ac4202
May 20 2015 13:22:01 GMT: INFO (scan): (thr_tscan.c::2075) started 32 threads
May 20 2015 13:22:01 GMT: INFO (batch): (thr_batch.c::342) Initialize 4 batch worker threads.
May 20 2015 13:22:01 GMT: INFO (drv_ssd): (drv_ssd.c::4334) {testdata} floor set at 45 wblocks per device
May 20 2015 13:22:06 GMT: INFO (paxos): (paxos.c::3212) listening for other nodes (max 6000 milliseconds) ...
May 20 2015 13:22:06 GMT: INFO (hb): (hb.c::1971) connecting to remote heartbeat service at 10.5.8.161:3002
May 20 2015 13:22:12 GMT: INFO (hb): (hb.c::1055) initiated connection to mesh host at 10.5.8.161:3002 (10.5.8.161:3002) via socket 58 from 172.17.0.49:40147
May 20 2015 13:22:12 GMT: INFO (paxos): (paxos.c::3229) ... no other nodes detected - node will operate as a single-node cluster
May 20 2015 13:22:12 GMT: INFO (partition): (partition.c::4066) {testdata} 0 absent partitions promoted to master
May 20 2015 13:22:12 GMT: INFO (paxos): (paxos.c::3189) paxos supervisor thread started
May 20 2015 13:22:12 GMT: INFO (nsup): (thr_nsup.c::1289) namespace supervisor started
May 20 2015 13:22:12 GMT: INFO (ldt): (thr_nsup.c::1246) LDT supervisor started
May 20 2015 13:22:12 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #8 for thr_demarshal()
May 20 2015 13:22:12 GMT: INFO (demarshal): (thr_demarshal.c::249) Service started: socket 3000
May 20 2015 13:22:13 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #9 for thr_demarshal()
May 20 2015 13:22:13 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #10 for thr_demarshal()
May 20 2015 13:22:13 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #11 for thr_demarshal()
May 20 2015 13:22:14 GMT: INFO (demarshal): (thr_demarshal.c::726) Waiting to spawn demarshal threads ...
May 20 2015 13:22:14 GMT: INFO (demarshal): (thr_demarshal.c::729) Started 4 Demarshal Threads
May 20 2015 13:22:14 GMT: INFO (as): (as.c::450) service ready: soon there will be cake!
May 20 2015 13:22:18 GMT: INFO (hb): (hb.c::2388) new heartbeat received: bb9120011ac4202 principal node is bb9120011ac4202
May 20 2015 13:22:18 GMT: INFO (fabric): (fabric.c::1780) fabric: node bb9120011ac4202 arrived
May 20 2015 13:22:18 GMT: INFO (paxos): (paxos.c::1933) as_paxos_spark establishing transaction [1.0]@bb9310011ac4202 ClSz = 1 ; # change = 1 : ADD bb9120011ac4202;
May 20 2015 13:22:18 GMT: INFO (partition): (partition.c::392) DISALLOW MIGRATIONS
May 20 2015 13:22:18 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9120011ac4202
May 20 2015 13:22:18 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [1.0]@bb9310011ac4202: bb9310011ac4202 bb9120011ac4202
May 20 2015 13:22:20 GMT: INFO (drv_ssd): (drv_ssd.c::2436) device /storage/data/testdata.dat: used 0, contig-free 4095M (4095 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s)
May 20 2015 13:22:22 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9120011ac4202 and self bb9310011ac4202
May 20 2015 13:22:22 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9310011ac4202 retransmitting sync messages to nodes that have not responded yet ...
May 20 2015 13:22:22 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9120011ac4202
May 20 2015 13:22:22 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [1.0]@bb9310011ac4202: bb9310011ac4202 bb9120011ac4202
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4744) system memory: free 5398612kb ( 88 percent free )
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4752) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 0 ::: sub_objects 0
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4760) rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4766) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4776) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 0, 0) : hb (2, 4, 2) : fab (17, 18, 1)
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4778) heartbeat_received: self 0 : foreign 116
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4779) heartbeat_stats: bt 0 bf 103 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 2 rc 2
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4792) tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4808) namespace testdata: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
May 20 2015 13:22:24 GMT: INFO (info): (thr_info.c::4853) partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 20 2015 13:22:24 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
May 20 2015 13:22:27 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9120011ac4202 and self bb9310011ac4202
May 20 2015 13:22:27 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9310011ac4202 retransmitting sync messages to nodes that have not responded yet ...
May 20 2015 13:22:27 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9120011ac4202
May 20 2015 13:22:27 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [1.0]@bb9310011ac4202: bb9310011ac4202 bb9120011ac4202
May 20 2015 13:22:32 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9120011ac4202 and self bb9310011ac4202
May 20 2015 13:22:32 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9310011ac4202 retransmitting sync messages to nodes that have not responded yet ...
May 20 2015 13:22:32 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9120011ac4202
May 20 2015 13:22:32 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [1.0]@bb9310011ac4202: bb9310011ac4202 bb9120011ac4202
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4744) system memory: free 5418928kb ( 88 percent free )
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4752) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 0 ::: sub_objects 0
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4760) rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4766) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4776) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 0, 0) : hb (2, 4, 2) : fab (17, 20, 3)
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4778) heartbeat_received: self 0 : foreign 248
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4779) heartbeat_stats: bt 0 bf 235 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 2 rc 2
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4792) tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4808) namespace testdata: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
May 20 2015 13:22:34 GMT: INFO (info): (thr_info.c::4853) partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 20 2015 13:22:34 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
May 20 2015 13:22:37 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9120011ac4202 and self bb9310011ac4202
May 20 2015 13:22:37 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9310011ac4202 retransmitting sync messages to nodes that have not responded yet ...
May 20 2015 13:22:37 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9120011ac4202
May 20 2015 13:22:37 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [1.0]@bb9310011ac4202: bb9310011ac4202 bb9120011ac4202
May 20 2015 13:22:40 GMT: INFO (drv_ssd): (drv_ssd.c::2436) device /storage/data/testdata.dat: used 0, contig-free 4095M (4095 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s)
May 20 2015 13:22:42 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9120011ac4202 and self bb9310011ac4202
May 20 2015 13:22:42 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9310011ac4202 retransmitting sync messages to nodes that have not responded yet ...
May 20 2015 13:22:42 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9120011ac4202
May 20 2015 13:22:42 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [1.0]@bb9310011ac4202: bb9310011ac4202 bb9120011ac4202
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4744) system memory: free 5382248kb ( 88 percent free )
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4752) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 0 ::: sub_objects 0
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4760) rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4766) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4776) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 0, 0) : hb (2, 4, 2) : fab (16, 22, 6)
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4778) heartbeat_received: self 0 : foreign 382
May 20 2015 13:22:44 GMT: INFO (info): (thr_info.c::4779) heartbeat_stats: bt 0 bf 369 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 2 rc 2
On the second container I am getting the following log:
May 20 2015 13:19:36 GMT: INFO (cf:misc): (id.c::119) Node ip: 172.17.0.18
May 20 2015 13:19:36 GMT: INFO (cf:misc): (id.c::265) Heartbeat address for mesh: 172.17.0.18
May 20 2015 13:19:36 GMT: INFO (config): (cfg.c::3099) Rack Aware mode not enabled
May 20 2015 13:19:36 GMT: INFO (config): (cfg.c::3102) Node id bb9120011ac4202
May 20 2015 13:19:36 GMT: INFO (namespace): (namespace_cold.c::101) ns testdata beginning COLD start
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::3980) opened file /storage/data/testdata.dat: usable size 4294967296
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::1070) /storage/data/testdata.dat has 4096 wblocks of size 1048576
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::3438) device /storage/data/testdata.dat: reading device to load index
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::3443) In TID 12: Using arena #150 for loading data for namespace "testdata"
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::3464) device /storage/data/testdata.dat: read complete: UNIQUE 0 (REPLACED 0) (GEN 0) (EXPIRED 0) (MAX-TTL 0) records
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::1035) ns testdata loading free & defrag queues
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::969) /storage/data/testdata.dat init defrag profile: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::1059) /storage/data/testdata.dat init wblock free-q 4095, defrag-q 0
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::2663) ns testdata starting device maintenance threads
May 20 2015 13:19:36 GMT: INFO (as): (as.c::410) initializing services...
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::1741) ns testdata starting write worker threads
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::886) ns testdata starting defrag threads
May 20 2015 13:19:36 GMT: INFO (tsvc): (thr_tsvc.c::1021) shared queues: 4 queues with 4 threads each
May 20 2015 13:19:36 GMT: INFO (hb): (hb.c::2459) heartbeat socket initialization
May 20 2015 13:19:36 GMT: INFO (hb): (hb.c::2473) initializing mesh heartbeat socket : 172.17.0.18:3002
May 20 2015 13:19:36 GMT: INFO (info): (thr_info.c::5224) static external network definition
May 20 2015 13:19:36 GMT: INFO (paxos): (paxos.c::3127) partitions from storage: total 4096 found 4096 lost(set) 0 lost(unset) 0
May 20 2015 13:19:36 GMT: INFO (partition): (partition.c::4011) {testdata} 4096 partitions: found 0 absent, 4096 stored
May 20 2015 13:19:36 GMT: INFO (paxos): (paxos.c::3131) Paxos service ignited: bb9120011ac4202
May 20 2015 13:19:36 GMT: INFO (scan): (thr_tscan.c::2075) started 32 threads
May 20 2015 13:19:36 GMT: INFO (batch): (thr_batch.c::342) Initialize 4 batch worker threads.
May 20 2015 13:19:36 GMT: INFO (drv_ssd): (drv_ssd.c::4334) {testdata} floor set at 45 wblocks per device
May 20 2015 13:19:42 GMT: INFO (paxos): (paxos.c::3212) listening for other nodes (max 6000 milliseconds) ...
May 20 2015 13:19:42 GMT: INFO (hb): (hb.c::1971) connecting to remote heartbeat service at 10.5.8.160:3002
May 20 2015 13:19:42 GMT: INFO (hb): (hb.c::1055) initiated connection to mesh host at 10.5.8.160:3002 (10.5.8.160:3002) via socket 59 from 172.17.0.18:49048
May 20 2015 13:19:42 GMT: INFO (paxos): (paxos.c::3220) ... other node(s) detected - node will operate in a multi-node cluster
May 20 2015 13:19:42 GMT: INFO (paxos): (paxos.c::3189) paxos supervisor thread started
May 20 2015 13:19:42 GMT: INFO (nsup): (thr_nsup.c::1289) namespace supervisor started
May 20 2015 13:19:42 GMT: INFO (ldt): (thr_nsup.c::1246) LDT supervisor started
May 20 2015 13:19:42 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #8 for thr_demarshal()
May 20 2015 13:19:42 GMT: INFO (hb): (hb.c::2388) new heartbeat received: bb9310011ac4202 principal node is bb9310011ac4202
May 20 2015 13:19:42 GMT: INFO (fabric): (fabric.c::1780) fabric: node bb9310011ac4202 arrived
May 20 2015 13:19:42 GMT: INFO (paxos): (paxos.c::2151) Skip node arrival bb9310011ac4202 cluster principal bb9120011ac4202 pulse principal bb9310011ac4202
May 20 2015 13:19:42 GMT: INFO (demarshal): (thr_demarshal.c::249) Service started: socket 3000
May 20 2015 13:19:43 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #10 for thr_demarshal()
May 20 2015 13:19:43 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #11 for thr_demarshal()
May 20 2015 13:19:43 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #12 for thr_demarshal()
May 20 2015 13:19:44 GMT: INFO (demarshal): (thr_demarshal.c::726) Waiting to spawn demarshal threads ...
May 20 2015 13:19:44 GMT: INFO (demarshal): (thr_demarshal.c::729) Started 4 Demarshal Threads
May 20 2015 13:19:44 GMT: INFO (as): (as.c::450) service ready: soon there will be cake!
May 20 2015 13:19:47 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9310011ac4202 and self bb9120011ac4202
May 20 2015 13:19:52 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9310011ac4202 and self bb9120011ac4202
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4744) system memory: free 5432856kb ( 88 percent free )
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4752) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 0 ::: sub_objects 0
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4760) rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4766) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4776) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 0, 0) : hb (2, 2, 0) : f
ab (16, 16, 0)
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4778) heartbeat_received: self 0 : foreign 200
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4779) heartbeat_stats: bt 0 bf 159 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0
um 0 mcf 0 rc 0
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4792) tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4808) namespace testdata: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
May 20 2015 13:19:54 GMT: INFO (info): (thr_info.c::4853) partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 20 2015 13:19:54 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
May 20 2015 13:19:56 GMT: INFO (drv_ssd): (drv_ssd.c::2436) device /storage/data/testdata.dat: used 0, contig-free 4095M (4095 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 0 (0.0/s
), defrag-q 0 defrag-tot 0 (0.0/s)
May 20 2015 13:19:57 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9310011ac4202 and self bb9120011ac4202
May 20 2015 13:20:02 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9310011ac4202 and self bb9120011ac4202
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4744) system memory: free 5409236kb ( 88 percent free )
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4752) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 0 ::: sub_objects 0
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4760) rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4766) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4776) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 0, 0) : hb (2, 2, 0) : f
ab (16, 16, 0)
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4778) heartbeat_received: self 0 : foreign 334
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4779) heartbeat_stats: bt 0 bf 293 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0
um 0 mcf 0 rc 0
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4792) tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4808) namespace testdata: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
May 20 2015 13:20:04 GMT: INFO (info): (thr_info.c::4853) partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
May 20 2015 13:20:04 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
May 20 2015 13:20:07 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9310011ac4202 and self bb9120011ac4202
May 20 2015 13:20:12 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9310011ac4202 and self bb9120011ac4202
May 20 2015 13:20:12 GMT: INFO (hb): (hb.c::2319) HB node bb9310011ac4202 in different cluster - succession lists don't match
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4744) system memory: free 5422816kb ( 88 percent free )
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4752) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 0 ::: sub_objects 0
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4760) rec refs 0 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4766) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4776) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 0, 0) : hb (2, 2, 0) : f
ab (16, 16, 0)
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4778) heartbeat_received: self 0 : foreign 466
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4779) heartbeat_stats: bt 0 bf 425 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0
um 0 mcf 0 rc 0
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4792) tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
May 20 2015 13:20:14 GMT: INFO (info): (thr_info.c::4808) namespace testdata: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
For some reason the cluster is willing to synchronize things but this does not work at all like expected. Can you help me troubleshoot the issue and allow the cluster to form successfully?
Thanks in advance,
Julien