New enumerate_inter code in 3.10.0 causes fatal error

Jason_Cross · October 13, 2016, 3:30am

We use the community version RPM for CentOS 7 and are on 3.10.0.

It appears that aerospike changed its way of enumerating network links, IPs, and routes with 3.10.0. Specifically, the enumerate_inter function in socket.c: https://github.com/aerospike/aerospike-server/blob/3.10.0/cf/src/socket.c#L1960

There’s some sort of check that tries to compare the old way of getting interfaces using getifaddrs() to the new way using rtnetlink and ensuring that the sort order and contents of the interfaces are the same. That check is in a section of the code marked as “BEGIN PARANOIA”: https://github.com/aerospike/aerospike-server/blob/3.10.0/cf/src/socket.c#L2009

While the code appears to normally work, since the rtnetlink calls happen first over 3 different netlink msg send/receives calls and then latter the getifaddrs() call is made, it appears there is a race condition that can cause the state of the 2 sets of network interfaces to be out of sync with each other but otherwise correct at the point-in-time they were collected.

We currently run aerospike on bare metal using the docker engine that comes with Centos 7: docker-1.10.3-44.el7.centos.x86_64

We use host networking in docker for the aerospike container, so the container sees all the host’s network interfaces, including those created for other docker containers that use bridge networking.

Due to various internal reasons, we have lots of other docker containers on that server that may restart rapidly. This causes many veth* interfaces to be created and destroyed in a short amount of time. It appears that seemingly randomly during some container shutdowns/destructions, there’s a pause where it takes a a few seconds for the docker container to be destroyed. I noticed this when manually creating and destroying containers by starting up a new container running and logging out of it. The pause would happen after logging out but before I was returned to the host bash shell. I’m not sure of the exact cause, but when this happens, our aerospike server crashes. If I just manually create and destroy containers for a while, whenever I get that random pause, aerospike dies. It outputs this message: Oct 12 2016 05:50:04 GMT: CRITICAL (cf:socket): (socket.c:2051) Unexpected legacy-enumerated interface vethXXXXXX

Note that it may not be a veth device, it could also be a br-XXXX device or some other network device that causes the error.

We haven’t had this problem on our other servers where we run aerospike in production under docker, but those other servers do not have docker containers being rapidly created and destroyed. Note that when I say rapidly, I mean up to 25 may be created or destroyed at the same time in batches depending on how things time out. In addition, the enumerate_inter() function appears to be called about every 30 seconds on our server instance, so the chance of it being called during a docker container going up or down is relatively high.

I removed the PARANOIA code, recompiled 3.10.0, built the rpm, and put it on the affected server. So far we’ve had no problems. I’m not sure what affects that might have though. The interface we use for aerospike is statically configured via the host’s network set up. So, its interface and IP shouldn’t every change in any way after host startup unless a manual change is made.

Any thoughts?

tlo · October 13, 2016, 5:24am

Thanks for the detailed analysis and report, Jason. Your use case is indeed a use case that we didn’t think of, when we put the paranoia code in there. I’m sorry to hear that this is causing you trouble.

TLDR: Removing the code is OK, you should still be fine.

Here’s how the paranoia code came about: Aerospike 3.10 supports IPv6 and we redid the whole networking API to abstract away the differences between IPv4 and IPv6. In this context, we revisited a few other past decisions and ended up moving interface enumeration from glibc’s getifaddrs() to netlink.

Here’s the paranoia part: The MAC address of one of the enumerated interfaces is included in the node ID. If anything changes about the enumeration, then the node IDs of a cluster could change. So, you’d upgrade to 3.10 (new enumeration) from an earlier version (old enumeration) and suddenly your node IDs would all be different.

That’s why, for now, we have the paranoia code in place that compares the new netlink enumeration to the old getifaddrs() enumeration.

We did look at the glibc source code to ascertain that what we’re doing is equivalent to what glibc’s doing. But still. Who knows what all older versions of glibc do? Or glibc versions patched by the individual Linux distributions? That’s why we decided to be paranoid.

All in all, removing the paranoia code is fine. It’ll likely go away in 3.11 anyway.

Jason_Cross · October 17, 2016, 9:45pm

Awesome, thanks for the quick and detailed reply!

blonkel · October 30, 2016, 7:04pm

we also having trouble with the new network:

Oct 24 2016 18:03:41 GMT: CRITICAL (cf:socket): (socket.c:2059) Extraneous interface veth6d4d1e1 Oct 30 2016 16:11:41 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use Oct 30 2016 16:12:13 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use Oct 30 2016 16:12:46 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use Oct 30 2016 16:14:27 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use Oct 30 2016 16:15:02 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use Oct 30 2016 18:47:40 GMT: CRITICAL (cf:socket): (socket.c:2051) Unexpected legacy-enumerated interface vethd7e2fb9 Oct 30 2016 18:48:22 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use Oct 30 2016 18:49:02 GMT: CRITICAL (fabric): (fabric.c:1839) Could not bind note server name : 98 Address already in use

The reason is more or less the same as above. We building docker containers on our production server, which during build spawns interfaces which are removed later on again.

We didnt had any issues before the network upgrade. Is there an ETA for a proper fix?

moon · April 27, 2017, 2:10am

this error still exists in 3.11 docker container net host

Apr 27 2017 02:01:05 GMT: FAILED ASSERTION (cf:socket): (socket.c:1812) Too many interfaces Apr 27 2017 02:01:05 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.11.1.1 os ubuntu16.04

tlo · April 27, 2017, 9:20am

Thank you for reporting this. This is very odd. It means that, when running inside the Docker container, Aerospike detects more than 50 (!) network interfaces.

50 is an arbitrary limit on the number of supported network interface that we picked, because it seemed “high enough for everyone.” It can easily be increased (MAX_INTERS in socket.c). But let’s first make sure that this is not caused by something else, i.e., a bug in our interface enumeration code.

Let’s dig a little deeper, please. Two things that I’d like to ask:

Can you run “ip link show” inside the container and share the output? This would list the interfaces inside the container. Let’s see how many there are and what they look like.
Can you enable socket debug logging in Aerospike and share the log file? Simply add “context cf:socket detail” to the log file configuration section of your aerospike.conf file:
```
logging {
    file [...]/aerospike.log {
        context cf:socket detail
        [...]
    }
}
```

Topic		Replies	Views
Aerospike CE 3.14.1.1 crashes with "Error while enumerating network links" Operations crash , docker	5	2479	July 18, 2017
Fatal error: Too many interfaces Operations	0	1394	November 4, 2016
Unexpected legacy-enumerated interface crashing aerospike	3	593	April 27, 2020
Error while enumerating network routes in Aerospike 3.10.1 Operations	2	1861	December 3, 2016
Missing server source so I can't upgrade Upgrading	5	1279	April 27, 2018

New enumerate_inter code in 3.10.0 causes fatal error

Related topics