Aql segmentation fault when receiving large result set

Hello,

I’m trying to do a rather large query on index in my sample database. I have 40Million Entries and a distinct number of values for the index I’m querying. My namespace is running in a memory only configuration with no persistency. I’m with aerospike version 3.3.21 on Red Hat 6.4 (x86_64).

# gdb aql
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
(gdb) r
Starting program: /usr/bin/aql 
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff7de5700 (LWP 27497)]
Aerospike Query
Copyright 2013 Aerospike. All rights reserved.

aql> show indexes
+--------+--------------+---------+----------+-------+-------------+------------+--------------+
| ns     | bins         | set     | num_bins | state | indexname   | sync_state | type         |
+--------+--------------+---------+----------+-------+-------------+------------+--------------+
| "test" | "filesize"   | "demo2" | 1        | "RW"  | "fsize2"    | "synced"   | "INT SIGNED" |
| "test" | "originalId" | "demo2" | 1        | "RW"  | "original"  | "synced"   | "TEXT"       |
| "test" | "stamp"      | "demo2" | 1        | "RW"  | "timestamp" | "synced"   | "INT SIGNED" |
+--------+--------------+---------+----------+-------+-------------+------------+--------------+
3 rows in set (0.001 secs)

aql> select stamp from test.demo2  where stamp between 1414885153 and 1420000000;
         <snip large outputs>
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff0083700 (LWP 27535)]
0x0000003c532747fa in _IO_default_xsputn_internal () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install aerospike-tools-3.3.22-1.el6.x86_64
(gdb) bt
#0  0x0000003c532747fa in _IO_default_xsputn_internal () from /lib64/libc.so.6
#1  0x0000003c532443a9 in vfprintf () from /lib64/libc.so.6
#2  0x0000003c53269889 in vsprintf () from /lib64/libc.so.6
#3  0x0000003c5324f538 in sprintf () from /lib64/libc.so.6
#4  0x0000000000472277 in as_integer_val_tostring (v=<value optimized out>) at src/main/aerospike/as_integer.c:86
#5  0x0000000000461c4c in each_bin (name=0x7fffef684aa0 "stamp", val=0x7fffef684ab0, udata=0x82fb90) at src/main/renderer/table.c:149
#6  0x000000000046e758 in as_record_foreach (rec=0x7ffff0082c30, callback=0x461aaa <each_bin>, udata=0x82fb90) at src/main/aerospike/as_record.c:520
#7  0x0000000000461a6a in as_rec_foreach (rec=0x7ffff0082c30, callback=0x461aaa <each_bin>, udata=0x82fb90) at /home/citrusleaf/BUILD/aerospike-client-c/modules/common/src/include/aerospike/as_rec.h:632
#8  0x0000000000462558 in render (val=0x7ffff0082c30, view=0x82fb90) at src/main/renderer/table.c:313
#9  0x000000000046270a in citrusleaf_query_foreach_callback (v=<value optimized out>, udata=<value optimized out>) at src/main/citrusleaf/cl_query.c:1451
#10 0x00000000004632d8 in cl_query_worker_do (node=0x7f03e0, task=<value optimized out>) at src/main/citrusleaf/cl_query.c:940
#11 0x0000000000463360 in cl_query_worker (pv_asc=0x7f0150) at src/main/citrusleaf/cl_query.c:1023
#12 0x0000003c53607851 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003c532e890d in clone () from /lib64/libc.so.6

Is this normal? I have isolated the same issue using the C client library as well by customizing an example you ship with the library.

Does this indicate a limit on the numver of items I can fetch at any given time or is it bug regarding data type conversion from integer to string when you pass the val to sprintf in frame 3 (merely just a display problem)?

Best Regards, Leonidas Tsampros

Hi Leonidas,

Thanks for bringing this to our attention.

I’m sorry to note that you encountered this crash while running a range-query via aql. As you have already observed, the issue is not with the tool, but with the client. It looks like a potential stack-corruption and not just a display bug. I’ve filed a ticket for this problem. We’ll let you know as soon as this gets fixed and is available.

Sincerely, Bhuvana

1 Like

Hello Bhuvan,

Thanks for letting me know.

Looking forward to a fix.

Best Regards Leonidas Tsampros

Hi Leonidas,

Which C-client example did you tweak ? and what was the specific tweak that caused the crash ?

Thanks Bhuvana

Hello Bhuvana,

I modified the example under “examples/query_examples/simple/src”. The modified file is here:

Mainly I did the following changes:

  1. Removed the parts with index creation, record insertion etc etc.
  2. Removed the cleanup function
  3. Added as_query_select() to choose which bins to get from each record.
  4. Modified predicate to integer_range()
  5. Removed dumping of records from query_callback.

I noticed that the crash is a different though:

#0  0x000000000040c192 in as_record_bin_forupdate (rec=0x7ffff50d2c90, name=0x7ffff50cd990 "stamp") at src/main/aerospike/as_record.c:101
#1  0x000000000040c5b0 in as_record_set_int64 (rec=<optimized out>, name=0x7ffff50cd990 "stamp", value=1414136640) at src/main/aerospike/as_record.c:239
#2  0x0000000000424ed5 in clbin_to_asrecord (bin=0x7ffff50cd990, r=0x7ffff50d2c90) at src/main/aerospike/_shim.c:298
#3  0x0000000000424f28 in clbins_to_asrecord (bins=0x7ffff50cd990, nbins=<optimized out>, r=0x7ffff50d2c90) at src/main/aerospike/_shim.c:338
#4  0x000000000042019f in cl_query_worker_do (node=0x641390, task=0x7ffff50d2e60) at src/main/citrusleaf/cl_query.c:860
#5  0x0000000000420494 in cl_query_worker (pv_asc=0x641010) at src/main/citrusleaf/cl_query.c:1018
#6  0x00007ffff77e9e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007ffff6ecaccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x0000000000000000 in ?? ()

Sorry for the confusion.

Best Regards, Leonidas Tsampros

Thanks for the update. I’ll follow-up on this.

By the way I was able to reproduce the same error using a custom stream udf.

aql> aggregate query.my_stream_udf() on test.demo2 where stamp between 0 and 1411827150;
[New Thread 0x7ffff1485700 (LWP 7581)]
[New Thread 0x7ffff0a84700 (LWP 7582)]
[New Thread 0x7ffff0083700 (LWP 7583)]
[New Thread 0x7fffef682700 (LWP 7584)]
[New Thread 0x7fffeec81700 (LWP 7585)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff0083700 (LWP 7583)]
0x0000003c5327611c in malloc_consolidate () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003c5327611c in malloc_consolidate () from /lib64/libc.so.6
#1  0x0000003c532797cb in _int_malloc () from /lib64/libc.so.6
#2  0x0000003c5327a911 in malloc () from /lib64/libc.so.6
#3  0x0000000000472214 in as_integer_new (value=278929695) at src/main/aerospike/as_integer.c:59
#4  0x0000000000487de1 in as_unpack_integer (pk=<value optimized out>, val=0x7fffef684130) at src/main/aerospike/as_msgpack.c:407
#5  as_unpack_val (pk=<value optimized out>, val=0x7fffef684130) at src/main/aerospike/as_msgpack.c:520
#6  0x0000000000487fc5 in as_unpack_map (pk=0x7fffef684180, size=2, val=0x7fffef6841d8) at src/main/aerospike/as_msgpack.c:457
#7  0x0000000000472a9e in as_msgpack_serializer_deserialize (s=<value optimized out>, buff=<value optimized out>, v=<value optimized out>) at src/main/aerospike/as_msgpack_serializer.c:146
#8  0x00000000004659bf in as_serializer_deserialize (bin=0x7ffff0081970, r=0x7ffff0082c30) at /home/citrusleaf/BUILD/aerospike-client-c/modules/common/target/Linux-x86_64/include/aerospike/as_serializer.h:87
#9  clbin_to_asrecord (bin=0x7ffff0081970, r=0x7ffff0082c30) at src/main/aerospike/_shim.c:319
#10 0x0000000000465a62 in clbins_to_asrecord (bins=<value optimized out>, nbins=<value optimized out>, r=0x7ffff0082c30) at src/main/aerospike/_shim.c:339
#11 0x0000000000463075 in cl_query_worker_do (node=0x7f03e0, task=<value optimized out>) at src/main/citrusleaf/cl_query.c:865
#12 0x0000000000463360 in cl_query_worker (pv_asc=0x7f0150) at src/main/citrusleaf/cl_query.c:1023
#13 0x0000003c53607851 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003c532e890d in clone () from /lib64/libc.so.6

I think it’s relevant to previous segfaults since this problem only happens when I use large ranges. If I decrease the query range the issue is not reproduced.

Hi, Any updates on this?

Hi Leonidas,

I’m sorry about the delay. I’m not able to confirm that we have a fix for this issue, because I have not been able to reproduce it with the exact same back-trace.

Can you let me know what is the range of the large set result you are parsing ? How many entries exist and how many do you expect to get back for the range query ?

In the meanwhile, we have released a recent C-client with a potential fix for crashes when parsing large result sets.

Can you try this and let me know if you still see the crash ?

Sincerely, Bhuvana