Segmentation fault in Aerospike Server while deleting record using BatchWrite

Hi

Recently we have upgraded our Aerospike version from community edition version 5 to enterprise edition version 6.1.0 for Batch write support. But while using it we encountered Segmentation fault error on Aerospike Server. On investigation, we found that when we delete record using new BatchWrite(new Key(namespace, set, key), Operation.array(Operation.delete())) , we get Segmentation fault but not when deleted using new BatchDelete(new Key(namespace, set, key)).

Can someone please tell the difference between these two functions and why we are getting error on deleting record using BatchWrite?

Below is the aerospike logs -

Feb 27 2023 07:49:35 GMT: WARNING (flat): (flat.c:400) extra rblocks follow flat bin
Feb 27 2023 07:49:35 GMT: WARNING (record): (record.c:493) {replicator} record replace: failed unpickle bin 3309a6d6d12e49995f0184c1b674498c44d0bf18
Feb 27 2023 07:49:36 GMT: WARNING (as): (signal.c:191) SIGSEGV received, aborting Aerospike Enterprise Edition build 6.2.0.3 os debian10
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:621) stacktrace: registers: rax 00007f542800c600 rbx fffffffffffffe60 rcx cf3cf3cf3cf3cf3d rdx 000000000000000c rsi 00007f54c96937c0 rdi 00007f54c9694a20 rbp 00007f54c49031c8 rsp 00007f54773f6480 r8 00007f542820bab0 r9 000055df2073b360 r10 000000000000003a r11 00007f5428200000 r12 000000000000000d r13 00007f54c96937c0 r14 00007f54c9694a20 r15 00007f54c96949c8 rip 000055df1529cbfb
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:633) stacktrace: found 21 frames: 0x300c79 0x103056 0x7f54cae2c730 0x32dbfb 0x3318fb 0x33478b 0x35ea6a 0x329e71 0x2ef081 0xcbfde 0x1a50e2 0x1a5ecb 0x1a622a 0x1a743c 0x1a7e48 0x1191bc 0xffcac 0x2f19f4 0x2f1808 0x7f54cae21fa3 0x7f54ca76f4cf offset 0x55df14f6f000
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 0: /usr/bin/asd(cf_log_stack_trace+0xe8) [0x55df1526fc79]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 1: /usr/bin/asd(as_sig_handle_segv+0x21) [0x55df15072056]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 2: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f54cae2c730]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 3: /usr/bin/asd(+0x32dbfb) [0x55df1529cbfb]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 4: /usr/bin/asd(+0x3318fb) [0x55df152a08fb]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 5: /usr/bin/asd(+0x33478b) [0x55df152a378b]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 6: /usr/bin/asd(+0x35ea6a) [0x55df152cda6a]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 7: /usr/bin/asd(jem_mallocx+0xcf1) [0x55df15298e71]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 8: /usr/bin/asd(cf_alloc_malloc_arena+0x67) [0x55df1525e081]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 9: /usr/bin/asd(as_bin_particle_alloc_from_client+0x12c) [0x55df1503afde]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 10: /usr/bin/asd(write_master_bin_ops_loop+0x481) [0x55df151140e2]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 11: /usr/bin/asd(write_master_bin_ops+0xc9) [0x55df15114ecb]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 12: /usr/bin/asd(write_master_dim_single_bin+0x109) [0x55df1511522a]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 13: /usr/bin/asd(write_master+0x28b) [0x55df1511643c]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 14: /usr/bin/asd(as_write_start+0xe3) [0x55df15116e48]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 15: /usr/bin/asd(as_tsvc_process_transaction+0x58c) [0x55df150881bc]
Feb 27 2023 07:49:36 GMT: WARNING (as): (log.c:644) stacktrace: frame 16: /usr/bin/asd(+0xffcac) [0x55df1506ecac]

Could you run

addr2line -fie /usr/bin/asd 0x300c79 0x103056 0x7f54cae2c730 0x32dbfb 0x3318fb 0x33478b 0x35ea6a 0x329e71 0x2ef081 0xcbfde 0x1a50e2 0x1a5ecb 0x1a622a 0x1a743c 0x1a7e48 0x1191bc 0xffcac 0x2f19f4 0x2f1808 0x7f54cae21fa3 0x7f54ca76f4cf

I recommend that you open a case with Aerospike Enterprise Support … https://support.aerospike.com/s/

We did try to quickly reproduce this and could not, we will add details once the case is opened. We are going to ask for your help finding a minimal reproducible scenario.

Please find output for the command you mentioned

cf_log_stack_trace /work/source/cf/src/log.c:626 (discriminator 3) reraise_signal /work/source/as/src/base/signal.c:94 (discriminator 3) as_sig_handle_segv /work/source/as/src/base/signal.c:195 (discriminator 3) ?? ??:0 arena_run_heap_remove_first /work/source/modules/jemalloc/src/arena.c:114 (discriminator 146) arena_bin_nonfull_run_tryget /work/source/modules/jemalloc/src/arena.c:2352 arena_bin_nonfull_run_get /work/source/modules/jemalloc/src/arena.c:2368 arena_bin_malloc_hard /work/source/modules/jemalloc/src/arena.c:2421 je_arena_tcache_fill_small /work/source/modules/jemalloc/src/arena.c:2481 je_tcache_alloc_easy /work/source/modules/jemalloc/include/jemalloc/internal/tcache.h:260 je_tcache_alloc_small_hard /work/source/modules/jemalloc/src/tcache.c:86 je_tcache_alloc_small /work/source/modules/jemalloc/include/jemalloc/internal/tcache.h:303 je_arena_malloc /work/source/modules/jemalloc/include/jemalloc/internal/arena.h:1346 je_iallocztm /work/source/modules/jemalloc/include/jemalloc/internal/jemalloc_internal.h:1067 imallocx_flags /work/source/modules/jemalloc/src/jemalloc.c:2157 imallocx_no_prof /work/source/modules/jemalloc/src/jemalloc.c:2228 imallocx_body /work/source/modules/jemalloc/src/jemalloc.c:2271 jem_mallocx /work/source/modules/jemalloc/src/jemalloc.c:2286 do_mallocx /work/source/cf/src/alloc.c:946 cf_alloc_malloc_arena /work/source/cf/src/alloc.c:979 as_bin_particle_alloc_from_client /work/source/as/src/base/particle.c:512 write_master_bin_ops_loop /work/source/as/src/transaction/write.c:1874 write_master_bin_ops /work/source/as/src/transaction/write.c:1758 (discriminator 4) write_master_dim_single_bin /work/source/as/src/transaction/write.c:1281 write_master /work/source/as/src/transaction/write.c:846 as_write_start /work/source/as/src/transaction/write.c:302 as_tsvc_process_transaction /work/source/as/src/base/thr_tsvc.c:291 run_service /work/source/as/src/base/service.c:625 (discriminator 2) run_pool /work/source/cf/src/cf_thread.c:336 (discriminator 1) pool_shim_fn /work/source/cf/src/cf_thread.c:378 ?? ??:0 ?? ??:0

Also we are using Aerospike Enterprise version 6.2.0.3 not 6.1.0 . Although we tried it on 6.1.0 and were getting same error.

We will be creating case for this soon. Meanwhile if you want to reproduce this scenario, please use below

import com.aerospike.client.*;
import com.aerospike.client.policy.BatchPolicy;
import com.aerospike.client.policy.ClientPolicy;
import com.aerospike.client.policy.CommitLevel;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class AerospikeTest {
    public static void main(String[] args) {
        ClientPolicy clientPolicy = new ClientPolicy();
        clientPolicy.maxConnsPerNode = 20;
        clientPolicy.writePolicyDefault.commitLevel = CommitLevel.COMMIT_MASTER;
        clientPolicy.batchWritePolicyDefault.commitLevel = CommitLevel.COMMIT_MASTER;
        clientPolicy.writePolicyDefault.setTimeouts(5000, 5000);
        clientPolicy.writePolicyDefault.connectTimeout = 5000;

        AerospikeClient client = new AerospikeClient(clientPolicy ,new Host("<host>",3000));

        String namespace = "test_namespace";
        String set = "test_set";
        List<BatchRecord> batchRecords = new ArrayList<>();

        for(int i=0;i<500;i++){
            Key key = new Key(namespace, set,rand());
            batchRecords.add(new BatchWrite(key, Operation.array(Operation.delete())));
            batchRecords.add(new BatchWrite(key, Operation.array(Operation.put(new Bin("",Value.get(rand()))))));
        }

        BatchPolicy policy = client.batchParentPolicyWriteDefault;
        policy.allowInlineSSD = true;
        policy.respondAllKeys= false;

        try{
            for (int i=0; i<100; i++) {
                client.operate(policy,batchRecords);
            }
        }
        catch(Exception e){
            System.out.println(e);
        }
        client.close();
    }
    public static String rand(){
        byte[] array = new byte[7]; // length is bounded by 7
        new Random().nextBytes(array);
        String generatedString = new String(array, Charset.forName("UTF-8"));
        return generatedString;
    }
}

Hello Sunil,

Thanks for the code to reproduce. It took a while, but we eventually reproduced it. (Note - if you configure “debug-allocations true”, then it happens every time.)

It is a double free. It has nothing to do with batch writes in particular – any write with the delete-all-bins operation will double free the bin data. It also only is an issue if you are configured “single-bin true” and “data-in-memory true”, which of course you are. (Note – in 6.4, two releases from now, the single-bin configuration will no longer be supported, and you should consider moving away from that.)

We will fix this today, and it will be available in a hotfix next week.

Separately, I am a bit worried by the log lines showing in your first report:

Feb 27 2023 07:49:35 GMT: WARNING (flat): (flat.c:400) extra rblocks follow flat bin Feb 27 2023 07:49:35 GMT: WARNING (record): (record.c:493) {replicator} record replace: failed unpickle bin 3309a6d6d12e49995f0184c1b674498c44d0bf18

The bug we are fixing does not explain these, as far as we can tell. One thing that might explain them is you are running a cluster in which this node is “single-bin” but some other node is not.

Anyhow, we will definitely fix the crash, whether or not we can explain the above warnings.

Andy

Hello Andy

Thanks for responding quickly. One thing I still don’t understand is why deleting records with BatchDelete function doesn’t cause this issue. Isn’t BatchDelete also delete-all-bins operation?

BatchDelete is not the same thing. It generates a record delete transaction that goes on a different code path, as opposed to a record delete operation that can go within an atomic set of operations. If all you want to do is delete a record, BatchDelete is fine. If you want to mix deleting the record (atomically) with other operations, like read it then delete it, then you would use the delete-record operation with other operations. You can of course use it as an operation by itself, and it will do the same thing as BatchDelete, but via a different code path which had the bug.

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.