Any LDT causes server crash (CE 3.4.1.) [Released]


#1

Storing a list of thousands of integers for a record as a LDT.

Originally tried using LSET and LLIST but both were too complicated with add methods and caused errors when writing multiple items in 1 add operation (could be a problem with the C# driver).

LSTACK works well in adding the collection of integers but 5k TPS of inserting a record then inserting a LSTACK of a 1-2k integers for each record almost guarantees a SIGSEGV fault and Aerospike process shutdown. I’ve removed all SSD devices and used pure in-memory namespaces (running on r3.2xlarge on EC2 with 60GB ram) and this just delays the server crash.

Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::150) SIGSEGV received, aborting Aerospike Enterprise Edition build 3.3.21
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x46f52c]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 1: /lib64/libc.so.6(+0x33c60) [0x7f20a9975c60]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 2: /usr/bin/asd(msg_fillbuf+0x17) [0x4fd8a1]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 3: /usr/bin/asd(ldt_record_pickle+0x3fb) [0x462a2d]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 4: /usr/bin/asd(udf_rw_finish+0xfb) [0x4c156b]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 5: /usr/bin/asd(udf_rw_local+0x1b9) [0x4c24d6]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 6: /usr/bin/asd() [0x4ada85]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 7: /usr/bin/asd(as_rw_start+0x24f) [0x4aef05]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 8: /usr/bin/asd(process_transaction+0xd69) [0x4b9523]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 9: /usr/bin/asd(thr_tsvc_process_or_enqueue+0x3e) [0x4b9d79]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 10: /usr/bin/asd(thr_demarshal+0x389) [0x480384]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 11: /lib64/libpthread.so.0(+0x7f18) [0x7f20aa60ef18]
Nov 30 2014 01:23:50 GMT: WARNING (as): (signal.c::157) stacktrace: frame 12: /lib64/libc.so.6(clone+0x6d) [0x7f20a9a24b9d]

Is there any way to fix this crash with LDT’s? Serializing the numbers as a string runs into oversize bin/record errors.


#2

Hi,

We have to ask the usual questions: (1) What version of the server are you running? (2) What version of the client (the C# client, I assume) are you running? (3) Can you show us the snippet of code that builds the data object and then does the LDT write call?

In our regular regression tests we write thousands of 100kb objects in the LDTs. So, a serialized list of a few thousand integers should not be the problem.

One more thing to check: Are you writing single objects one at a time or are you writing a LIST of objects with a multi-write call? We know of a problem with multi-write calls. There is a current bug in the LDT multi-write calls (lstack.push_all(), llist.add_all(), lmap.put_all(), lset.add_all()). The a multi-write functions have a problem for multi-write calls where the list of values is greater in size than approximately 1200 items. So, currently, if you’re doing multi-writes, keep the multi-value list under 1000 data elements.


#3

Hey Toby,

Yep, I think I came to that same conclusion with multi-writes.

  1. Enterprise server 3.3.21, 1 cluster across 3 c3.2xlarge instances, 1 cluster on 2 r3.2xlarge instances all on AWS. Use in-memory namespaces with 50GB of space to remove any disk IO issues.

  2. Latest C# driver ver 3.0.10

        // reads a json file and gets 2100 items as a list of integers
        var file = File.ReadAllText(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "test.json"));
        var segments = JObject.Parse(file)["Categories"].ToArray().Select(x => Convert.ToInt32(x["Id"])).Distinct().ToList();
    
        var clientPolicy = new ClientPolicy
        {
            writePolicyDefault = { sendKey = true, sleepBetweenRetries = 1500 },
            queryPolicyDefault = { sleepBetweenRetries = 2000, recordQueueSize = 10000000 }
        };
    
        var aero = new AerospikeClient(clientPolicy, new Host("ec2-54-172-57-252.compute-1.amazonaws.com", 3000));
    
        var queryStatement = new Statement();
        queryStatement.SetNamespace("cookies");
        queryStatement.SetSetName("main");
        queryStatement.SetFilters(Filter.Range("version", 0, 10));
    
        var recordSet = aero.Query(null, queryStatement);
    
        try
        {
            var i = 0;
            while (recordSet.Next())
            {
                var set = aero.GetLargeSet(null, recordSet.Key, "categories", null);
                set.Add(segments.ConvertAll(Value.Get).ToArray());
                i++;
            }
    
            Console.WriteLine("{0} total", i);
        }
        catch
        {
        }
        finally
        {
            recordSet.Close();
        }
    

Running this as a simple console process on another plain server in the same availability zone was going through about 4k average write TPS. About 10 million records with just 2 small bins and the user key. None of the records had any LDT bins before starting a test loop.

LSET would fail pretty quick with ASD shutdown with either multi-write, single write in a loop or just small multi-writes (10 items). Biggest issue was unique values (needs an update method for sets to avoid having to do an exists lookup). LLIST did better but still crashed. LSTACK was the quickest since I could just push values but again eventually failed within 10 mins with ASD crash. Crashes usually took down more than 1 server which led to a cluster-wide loss.

I’m still not sure on why the ASD process had problems but for now I solved the issue by just chunking the list into 1000 item smaller lists and saving them as a simple serialized string in multiple bins on the top record. Write block size is 128k and there’s nothing else in these records so there’s plenty of space.

I think the main problem comes down to the C# driver as even doing a multi-write with the serialized string bins had overflow byte array problems sending the command and had to fallback to doing multiple individual puts for each bin. I’ll take another look at the driver code and do another pull request with anything I find.


#4

Hi. I’m back from being away on personal business – so let’s pick up where we left off.

What’s your current state? Still having problems? If so, we may need to address this directly via Aerospike Support. If it’s just a question on usage, we can address it in the forum. If it’s an actual crash you’re experiencing, then let’s address it via support.

Toby


#5

Hey Toby, thanks for following up.

Solved for now, it was a 2 part issue:

  1. The LDT types were being created in compact mode which failed when inserting lots of items as the first save. I tried using the custom user modules but this seems fixed with 3.3.26 so I just upgraded the servers.

  2. Driver still fails with large multi-operations. Not a big issue since this can just be broken out but I’ll try and find some time to dig through the C# driver and get more details on this.


#6

The issue being reported and many other bugs fixes have been resolved in release 3.4.1. Please try it out and let use know if it helps.

– R


#7

Thanks, rolling out 3.4.1 for our cluster but we stopped using LDT and just serialize our data for now.

There’s another feature we’re working on that will use LDTs for millions of objects so that should be a good test to see if everything is fixed, I’ll report back when we get to that.


#8

@manigandham,

Thank you for posting about LDTs in our forum. Please see the LDT Feature Guide for current LDT recommendations and best practices.


#9

@manigandham,

Effective immediately, we will no longer actively support the LDT feature and will eventually remove the API. The exact deprecation and removal timeline will depend on customer and community requirements. Instead of LDTs, we advise that you use our newer List and SortedMap APIs, which are now available in all Aerospike-supported clients at the General Availability level. Read our blog post for details.