CRITICAL (migrate): (migrate.c:migrate_tree_reduce:1872) malloc


#1

Server version: 3.5.8

Hey guys,

We are currently running a cluster with 10 nodes on amazon ec2 instances i2.2xlarge. Recently we’ve been getting a malloc error on random nodes which are causing them to shut down. This is the error:

Apr 16 2016 00:38:22 GMT: CRITICAL (migrate): (migrate.c:migrate_tree_reduce:1872) malloc

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Community Edition build 3.5.8

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: found 11 frames

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x59) [0x46d6c5]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib64/libc.so.6(+0x35650) [0x7fb3b89b0650]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x37) [0x7fb3b89b05d7]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib64/libc.so.6(abort+0x148) [0x7fb3b89b1cc8]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x271) [0x4ff298]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(migrate_tree_reduce+0x4f0) [0x4d9581]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /usr/bin/asd() [0x459654]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /usr/bin/asd(as_migrate_tree+0x9d) [0x4d9620]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 8: /usr/bin/asd(migrate_xmit_fn+0x903) [0x4da646]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 9: /lib64/libpthread.so.0(+0x7df5) [0x7fb3b987edf5]

Apr 16 2016 00:38:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 10: /lib64/libc.so.6(clone+0x6d) [0x7fb3b8a71bfd]

It was working very well in the past for a couple months, so we are not sure why this started happening. Do you guys have any insight on this issue?

Thanks, Phu


#2

Appears malloc failed to allocate. This may indicate the node was out of memory. Migration up to 3.7.5 would load an entire partition into memory before shipping, see AER-4667 in release notes. There have also been a handful of other memory related issues fixed since your version.

This may be related to Cannot allocate memory but there is still memory available.


#3

Hey kporter,

Thanks for your reply!

I’m monitoring the memory usage but I don’t see it ever becoming close to 55G, which is the disk-space we specified for our namespace. Usually it’s about 27~30GB, and it would suddenly get that error. Should I see the memory go up as the migration is happening?

Thanks, Phu


#4

The “CRITICAL (migrate): (migrate.c:migrate_tree_reduce:1872) malloc” message only happens when the malloc failed. Low memory is a likely culprit but not the only possibility. It could also be caused by memory fragmentation and corruption.

Can you describe your memory monitoring?


#5

Hey kporter,

I monitor using the free -g command. Over the weekend this issue hasn’t happened and all the migrations are almost completed. Do you think this could be some network issue?

Thanks, Phu


#6

Wouldn’t be able to determine that from this output. I would suspect memory fragmentation since these are rather large allocs here – the size of this alloc was partially why we moved away from this model in 3.7.5.