We use the community version RPM for CentOS 7 and are on 3.10.0.
It appears that aerospike changed its way of enumerating network links, IPs, and routes with 3.10.0. Specifically, the enumerate_inter function in socket.c: https://github.com/aerospike/aerospike-server/blob/3.10.0/cf/src/socket.c#L1960
There’s some sort of check that tries to compare the old way of getting interfaces using getifaddrs() to the new way using rtnetlink and ensuring that the sort order and contents of the interfaces are the same. That check is in a section of the code marked as “BEGIN PARANOIA”: https://github.com/aerospike/aerospike-server/blob/3.10.0/cf/src/socket.c#L2009
While the code appears to normally work, since the rtnetlink calls happen first over 3 different netlink msg send/receives calls and then latter the getifaddrs() call is made, it appears there is a race condition that can cause the state of the 2 sets of network interfaces to be out of sync with each other but otherwise correct at the point-in-time they were collected.
We currently run aerospike on bare metal using the docker engine that comes with Centos 7: docker-1.10.3-44.el7.centos.x86_64
We use host networking in docker for the aerospike container, so the container sees all the host’s network interfaces, including those created for other docker containers that use bridge networking.
Due to various internal reasons, we have lots of other docker containers on that server that may restart rapidly. This causes many veth* interfaces to be created and destroyed in a short amount of time. It appears that seemingly randomly during some container shutdowns/destructions, there’s a pause where it takes a a few seconds for the docker container to be destroyed. I noticed this when manually creating and destroying containers by starting up a new container running and logging out of it. The pause would happen after logging out but before I was returned to the host bash shell. I’m not sure of the exact cause, but when this happens, our aerospike server crashes. If I just manually create and destroy containers for a while, whenever I get that random pause, aerospike dies. It outputs this message: Oct 12 2016 05:50:04 GMT: CRITICAL (cf:socket): (socket.c:2051) Unexpected legacy-enumerated interface vethXXXXXX
Note that it may not be a veth device, it could also be a br-XXXX device or some other network device that causes the error.
We haven’t had this problem on our other servers where we run aerospike in production under docker, but those other servers do not have docker containers being rapidly created and destroyed. Note that when I say rapidly, I mean up to 25 may be created or destroyed at the same time in batches depending on how things time out. In addition, the enumerate_inter() function appears to be called about every 30 seconds on our server instance, so the chance of it being called during a docker container going up or down is relatively high.
I removed the PARANOIA code, recompiled 3.10.0, built the rpm, and put it on the affected server. So far we’ve had no problems. I’m not sure what affects that might have though. The interface we use for aerospike is statically configured via the host’s network set up. So, its interface and IP shouldn’t every change in any way after host startup unless a manual change is made.
Any thoughts?