Mesh heartbeat - "unable to parse heartbeat message" [Released] [Resolved]


#1

The issue discussed in this topic was resolved in server release 3.3.26.

by alyssa@nodeprime.com » Fri Aug 01, 2014 2:09 pm

Hi,

I’m having trouble getting two nodes to cluster in mesh mode. One node is 192.168.0.147. The second node is 192.168.0.146. After I bring up the second node, this is what I get in the logs:

Any ideas?

Thanks, Alyssa

Aug 01 2014 21:01:25 GMT: INFO (hb): (hb.c::1391) connecting to remote heartbeat service at 192.168.0.147:3002
Aug 01 2014 21:01:25 GMT: INFO (demarshal): (thr_demarshal.c::208) Saved original JEMalloc arena #4 for thr_demarshal()
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::194) msg_parse: but not enough data! will get called again. buf 0x7ff048bc6580 len 2048 need 6841094
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::194) msg_parse: but not enough data! will get called again. buf 0x7ff048bc6580 len 2048 need 132102
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::194) msg_parse: but not enough data! will get called again. buf 0x7ff048bc6580 len 2048 need 134218504
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::194) msg_parse: but not enough data! will get called again. buf 0x7ff048bc6580 len 2048 need 33554438
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::194) msg_parse: but not enough data! will get called again. buf 0x7ff048bc6580 len 2048 need 16779270
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (cf:msg): (msg.c::201) msg_parse: trying to parse incoming type 0 into msg type 1, bad bad
Aug 01 2014 21:01:25 GMT: WARNING (hb): (hb.c::1450) unable to parse heartbeat message
Aug 01 2014 21:01:25 GMT: INFO (demarshal): (thr_demarshal.c::236) Service started: socket 3000

#2

by devops02 » Fri Aug 01, 2014 3:41 pm

Hi Alyssa,

Can I have you copy your configuration here so I can look at?

Thanks,

Jerry


#3

by devops02 » Fri Aug 01, 2014 3:45 pm

Also forgot to ask if the environment your testing on, is this on a bare metal or in cloud?

Jerry


#4

by alyssa@nodeprime.com » Fri Aug 01, 2014 4:43 pm

Here’s the 192.168.0.147 configuration:

service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 15000
}

logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}

network {
service {
address any
access-address 192.168.0.147
port 3000
}

heartbeat {
# mode multicast
# address 239.1.99.222
# port 9918

# To use unicast-mesh heartbeats, comment out the 3 lines above and
# use the following 4 lines instead.
mode mesh
address 192.168.0.147
port 3002
mesh-address 192.168.0.146
mesh-port 3002

interval 250
timeout 50
}

fabric {
port 3001
}

info {
port 3003
}
}

Here's the configuration for 192.168.0.146:

service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 4
transaction-threads-per-queue 4
proto-fd-max 15000
}

logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}

network {
service {
address any
access-address 192.168.0.146
port 3000
}

heartbeat {
# mode multicast
# address 239.1.99.222
# port 9918

# To use unicast-mesh heartbeats, comment out the 3 lines above and
# use the following 4 lines instead.
mode mesh
address 192.168.0.146
port 3002
mesh-address 192.168.0.147
mesh-port 3002

interval 250
timeout 50
}

fabric {
port 3001
}

info {
port 3003
}
}

Also, I’m testing on my laptop in 2 VirtualBoxes running Ubuntu. I’m running it through Vagrant. I have a bridge network between the VirtualBoxes where these IPs are assigned. The bridge is interface eth1 on both.

eth0 Link encap:Ethernet HWaddr 08:00:27:88:0c:a6 
inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe88:ca6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:34485 errors:0 dropped:0 overruns:0 frame:0
TX packets:21907 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:27899653 (27.8 MB) TX bytes:1871583 (1.8 MB)

eth1 Link encap:Ethernet HWaddr 08:00:27:1e:21:fd 
inet addr:192.168.0.146 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:189513 errors:0 dropped:0 overruns:0 frame:0
TX packets:83717 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:29037468 (29.0 MB) TX bytes:19406316 (19.4 MB)

New node not getting added in running cluster
#5

by devops01 » Fri Aug 01, 2014 5:58 pm

Hi Alyssa,

This is a known issue on large mesh clusters. We are expecting a build by the end of the month to resolve this error.

best, Lucien


#6

by devops01 » Mon Aug 04, 2014 4:34 pm

In some cases since you have two NIcs, you may need to specify the interface similar to this:

service {
        address any
        port 3000
        access-address 192.168.0.147
        network-interface-name eth1
}

#7

The issue discussed in this topic was resolved in server release 3.3.26.