Aerospike CE 3.14.1.1 crashes with "Error while enumerating network links"

crash
docker

#1

Currently on our servers we see following error message.

Jul 14 2017 19:17:02 GMT: WARNING (socket): (socket.c:1701) Received netlink error message
Jul 14 2017 19:17:02 GMT: FAILED ASSERTION (socket): (socket.c:2040) Error while enumerating network links
Jul 14 2017 19:17:02 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.14.1.1 os ubuntu16.04
Jul 14 2017 19:17:02 GMT: FAILED ASSERTION (socket): (socket.c:1903) Invalid interface index: 1
Jul 14 2017 19:17:02 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.14.1.1 os ubuntu16.04
Jul 14 2017 19:17:02 GMT: WARNING (as): (signal.c:79) could not register default signal handler for 6

Message is quite rare (~once per week), and leads to node restart.

Aerospike launched inside Docker Container (managed by Kubernetes)

uname - a
4.4.61-1.el7.elrepo.x86_64 #1 SMP Wed Apr 12 11:53:28 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

docker info
Containers: 18
 Running: 14
 Paused: 0
 Stopped: 4
Images: 25
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: vg_system-thinpool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 4.631 GB
 Data Space Total: 536.9 GB
 Data Space Available: 532.2 GB
 Metadata Space Used: 3.432 MB
 Metadata Space Total: 10.74 GB
 Metadata Space Available: 10.73 GB
 Thin Pool Minimum Free Space: 53.69 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.61-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.66 GiB
Name: -----
ID: ----
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:

Aerospike Conf:

service {
  user root
  group root
  paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
  pidfile /var/run/aerospike/asd.pid
  service-threads 4
  transaction-queues 4
  transaction-threads-per-queue 4
  migrate-threads 4
  migrate-max-num-incoming 4
  proto-fd-max 15000
  scan-threads 8
  sindex-builder-threads 8
}

network {
  service {
    address any
    port 3000
  }

  heartbeat {
    mode mesh
    port 3002

    mesh-seed-address-port aerospike-1 3002
    mesh-seed-address-port aerospike-2 3002

    interval 150
    timeout 10
  }

  fabric {
    port 3001
  }

  info {
    port 3003
  }
}

#2

@tlo Is this related to:


#3

Hi Ivan,

Thanks for bringing this to our attention. Aerospike periodically queries the network interfaces in your machine to find out whether any IP addresses have changed. For this, it communicates with the Linux kernel via the Linux’s netlink mechanism.

It seems that every now and then, the kernel returns an error when Aerospike asks it to enumerate the network interfaces in the system. This is very strange and unexpected and I’d really like to figure out what it is.

May I ask for two things?

  1. In order for me to understand your network configuration a little better, can you run “ip link show” inside the container and give us the output of the command?

  2. The kernel may have logged an error message to the kernel log. I’m not familiar with how Docker handles this. Normally, without Docker, you can access the kernel log via the dmesg command or via /var/log/kern.log.

My guess would be, though, that kernel messages would show up in the kernel log outside the container. After all, with Docker, everything shares the same kernel.

Can thus check your /var/log/kern.log on your host (i.e., outside the containre) for any messages around the time when the issue last occurred?

Thanks for your help, Thomas


#4

Yes! Definitely possible!

Thomas


#5

Hi Thomas,

Thank for your reply.

root@aerospike-stable-1-x6qwx:/# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 0a:58:0a:d5:04:74 brd ff:ff:ff:ff:ff:ff link-netnsid 0

DMESG has no relevant records during this time (on host machine), only records with failed memory allocation (after crash) and i cannot get kernel log from long dead docker container.

[Fri Jul 14 19:15:18 2017] XFS (dm-16): Mounting V4 Filesystem
[Fri Jul 14 19:15:18 2017] XFS (dm-16): Ending clean mount
[Fri Jul 14 19:15:18 2017] XFS (dm-16): Unmounting Filesystem
[Fri Jul 14 19:15:18 2017] XFS (dm-16): Mounting V4 Filesystem
[Fri Jul 14 19:15:18 2017] XFS (dm-16): Ending clean mount
[Fri Jul 14 19:17:42 2017] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[Fri Jul 14 19:17:42 2017]   cache: blkdev_ioc(16671:4e070909d7658d4310f1822b2596ed4d17d79b4a325b0878e765b4c0c04ed336), object size: 104, buffer size: 104, default order: 0, min order: 0
[Fri Jul 14 19:17:42 2017]   node 0: slabs: 5, objs: 195, free: 0
[Fri Jul 14 19:17:42 2017]   node 1: slabs: 7, objs: 273, free: 0

Ivan


#6

We could try to modify source (add logging on error returned) and build from source, and see what logged then it happens next time.