GLibC Memory Corruption


#1

Synopsis

Aerospike 2.x servers running GLibC version prior to 2.12-1.149 may experience a crash due to memory corruption.

Solution:

It is recommended to upgrade to latest kernels and have a version of GLibC of 2.12-1.149 or greater.

Background:

Linux servers using GLibC versions 2.12 and later may encounter an issue where a node goes down and the /var/log/aerospke-console.[SERVER PID] contains the error *** glibc detected *** /usr/bin/cld: invalid fastbin entry (free): [ADDRESS] ***.

This may be due to a bug in GLibC. The RedHat Advisories provide additional details. See https://rhn.redhat.com/errata/RHBA-2014-0480.html and https://rhn.redhat.com/errata/RHSA-2014-1391.html

If you encounter this error, first determine if you are using an affected version of GLibC:

rpm -qa | grep glibc

If your version of GLibC is 2.12 and does not have a .149 or later suffix, then your server may be affected by this issue.

If so, apply the security fix as described by your vendor. For RedHat systems, see: https://rhn.redhat.com/errata/RHSA-2014-1391.html

For other systems, please contact your Linux vendor to determine if there is an update available to fix the issue specified in https://rhn.redhat.com/errata/RHBA-2014-0480.html.


#2

To be broader:

if you are running Centos 6 prior to 6.6, you are exposed to this bug. Extreme load on large multicore machines will lead to infrequent crashes.

Aerospike 3.x is not exposed, due to our use of JEMalloc in that code line. In recent 2.x, you can also switch to JEMalloc (contact support for our best practices of JEMalloc version). Our switch to JEMalloc was partially driven by the desire to test against a stable, known memory allocator, instead of “whatever version the distro has”. And – JEMalloc is better suited to Aerospike’s user patterns (faster & less fragmentation — use of JEMalloc arenas has shown great benefits).

To be more complex:

This has been a long term bug in Linux Clib. It was first reported in Jan 2013 against Clib 2.15, patched in december 2013, and that patch had to ripple up and down the distributions, which took mid and into 2014. It seems this patch has worked its way through all distributions. Clib 2.19 and forward are certainly free of this bug.

The bug was originally reported in Fedora 19, and there are repro recipes in various Ubuntu distros, so tracking down whether you are exposed or not might take some time. Generally, if you see the “invalid fastbin” it’s time to get up to date on your distro (don’t need to switch to a more recent distro, just be fully patched).

Reading the CLib bug report at sourceware.org instead of RedHat can be instructive, as the Sourceware bug list is the “upstream” that RedHat pulls from. If you are looking in other distros to see if the patch was applied, they should use the CLib bug number 15073. https://sourceware.org/bugzilla/show_bug.cgi?id=15073

For further interest to those who write system code in C, we were put on the right track by the “invalid fastbin entry” line. That is a message that a naive duplicate / free error won’t generate. Only a very special hit to memory from app code would cause that error, and when we saw it on 4 servers in exactly the same way, we realized the symptom was more likely a CLib problem than an application (aerospike) problem.

Further complicating our root cause analysis, this CLib corruption throws a SIGINT or a SIGTERM (I forget which). Just about any other signal would have been better, because both SIGINT and SIGTERM are generally user-generated signals. As developers have been confused when they CTL-C Aerospike in foreground mode then see a “callback stack” (why does my server crash when I CTL-C?), we had disabled logging callback stacks on these signals. No longer.

Don’t blame the CLib devs, though. The Linux/Unix/SysV/etc signal mechanism is long in the tooth and very hoary, and there isn’t a better signal to use. You can’t really use SIGSEGV because people patch that, and SIGFPE and SIGILL are very specific. You might argue for SIGSYS but I’ve never tried the POSIX.1-2001 signals - I wouldn’t expect a broadly ported library like GLibc to use more modern signals.