Originally posted by srknc on Mon Aug 04, 2014 10:07 am
Hi,
I’m discovering aerospike cluster on amazon ec2 and having some trouble about durability. Because aws ec2 doesn’t support multicast protocol, i’ve configured instances using unicast(mest). As i understand from limited documentation, there are two methods to configure unicast
- single seed configuration
- ring seed configuration.
To avoid possible problem when removing first seeding node, i’ve decided to use ring seed method. Configuration below (network stanza)
[server1]
heartbeat {
mode mesh
port 3002
mesh-address [server2 address]
mesh-port 3002
interval 150
timeout 50
}
[server2]
heartbeat {
mode mesh
port 3002
mesh-address [server3 address]
mesh-port 3002
interval 150
timeout 50
}
[server3]
heartbeat {
mode mesh
port 3002
mesh-address [server4 address]
mesh-port 3002
interval 150
timeout 50
}
[server4]
heartbeat {
mode mesh
port 3002
mesh-address [server1 address]
mesh-port 3002
interval 150
timeout 50
}
In this configuration, I’ve started services one by one (starting from server1) and it looks work expectedly.
I’ve configured php driver and develop a basic code to create constant write request on server 3.
When i stop server2 aerospace service, server3 aerospike service (the php code running on) refused connections for ~5 sec. All servers detected new cluster structure (i saw CLUSTER SIZE = 3 line at the log file) and then server 3 started to accept connections. So i lose data for 5 sec.
I have decided to make another test and just restarted server4 aerspike service, server 2 aerospike log entires below;
Aug 04 2014 16:33:20 GMT: INFO (partition): (partition.c::2834) CLUSTER SIZE = 4
Aug 04 2014 16:33:35 GMT: INFO (paxos): (paxos.c::2598) SINGLE NODE CLUSTER!!!
Aug 04 2014 16:33:35 GMT: INFO (partition): (partition.c::2834) CLUSTER SIZE = 1
Aug 04 2014 16:33:49 GMT: INFO (partition): (partition.c::2834) CLUSTER SIZE = 2
Aug 04 2014 16:33:51 GMT: INFO (partition): (partition.c::2834) CLUSTER SIZE = 3
I didn’t check data consistency but i’m sure i’ve lost lots of records.
Now, there is no write or read request, i’ve restarted server4 asd 3 min ago, server 2 logs and error below;
WARNING (hb): (hb.c::1500) cf_socket_sendto() failed 2
INFO (cf:socket): (socket.c::176) sendto() failed: 11 Resource temporarily unavailable
To solve the problem i’ve restart the instance at the server 2, now they gives integrity error; Aug 04 2014 16:50:54 GMT: INFO (paxos): (paxos.c::2207) CLUSTER INTEGRITY FAULT.
Network usage during the problem.
server1; 21Mbps out, 52Mbps in
server2; 155Mbps out, 26 Mbps in
server3; 17Mbps out, 65 Mbps in
server4; 20Mbps out, 58 Mbps in
As a result, i was just testing durability using basic configuration and only thing that i did is restarting the services giving them time to repair cluster. In my opinion mesh (unicast) heartbeat method is not production ready. If i’m missing a point or do you have a suggestion to make cluster more durable, please give your opinions because i really would like to use product.
Thank you.