8X slower when with LDT Bin


#1

from old forum Post by Hanson » Sat Aug 02, 2014 7:37 am

Aerospike 3.3.8 with In-Memory + Disk Persistence, on HP G6 server (24 CPU-threads, 72GB memory). Running local C Client with 4 threads, with/without LDT Bin (using lmap).

Insert record without LDT Bin has average latency 0.05ms, can reach 80K TPS Insert record with LDT Bin has average latency 0.4ms, can only reach 10K TPS, which is 8X slower.

“top –H” shows CPU usage on server side (process asd) is much higher with LDT Bin:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28646 root      20   0 24.7g  20g 1592 R 88.9 29.6  40:22.83 asd
28647 root      20   0 24.7g  20g 1592 R 88.9 29.6  40:56.04 asd
28648 root      20   0 24.7g  20g 1592 R 88.5 29.6  40:52.90 asd
28644 root      20   0 24.7g  20g 1592 R 85.9 29.6  40:16.49 asd

Increase the Client threads as 8, can only reach 12K TPS, which is not linear with #threads of Client. “top –H” shows CPU usage on server side is nearly 100% for each asd thread:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28648 root      20   0 24.8g  21g 1592 R 100 29.8  41:16.12 asd
28647 root      20   0 24.8g  21g 1592 R 100 29.8  41:19.23 asd
28644 root      20   0 24.8g  21g 1592 R 99.8 29.8  40:39.59 asd
28646 root      20   0 24.8g  21g 1592 R 99.8 29.8  40:46.02 asd

Oprofile result for Aerospike server with LDT Bin:

samples  %        image name               app name                 symbol name
1007984  18.1771  no-vmlinux               no-vmlinux               /no-vmlinux
937016   16.8973  asd                      asd                      propagatemark
882239   15.9095  asd                      asd                      sweeplist
421127    7.5942  asd                      asd                      singlestep
189619    3.4194  asd                      asd                      reallymarkobject
108067    1.9488  asd                      asd                      luaV_execute
106089    1.9131  libc-2.12.so             libc-2.12.so             __GI_memset
99792     1.7996  asd                      asd                      shash_reduce
83975     1.5143  libc-2.12.so             libc-2.12.so             vfprintf
76371     1.3772  libc-2.12.so             libc-2.12.so             _itoa_word
73682     1.3287  asd                      asd                      luaS_newlstr
72484     1.3071  asd                      asd                      luaC_fullgc
59107     1.0659  asd                      asd                      free
53147     0.9584  asd                      asd                      luaH_get
52318     0.9435  asd                      asd                      shash_create
50327     0.9076  asd                      asd                      luaD_precall
45976     0.8291  libpthread-2.12.so       libpthread-2.12.so       pthread_mutex_lock
41990     0.7572  asd                      asd                      malloc
40595     0.7321  asd                      asd                      as_val_val_destroy
39936     0.7202  asd                      asd                      as_index_get_vlock
32574     0.5874  libc-2.12.so             libc-2.12.so             _IO_default_xsputn
30221     0.5450  asd                      asd                      shash_destroy
29424     0.5306  asd                      asd                      as_index_reduce_traverse
27198     0.4905  libpthread-2.12.so       libpthread-2.12.so       pthread_mutex_unlock
21553     0.3887  libcrypto.so.1.0.1e      libcrypto.so.1.0.1e      /usr/lib64/libcrypto.so.1.0.1e
20476     0.3692  asd                      asd                      luaV_gettable
20299     0.3661  libc-2.12.so             libc-2.12.so             memcpy
19291     0.3479  libc-2.12.so             libc-2.12.so             __strlen_sse2
19018     0.3430  asd                      asd                      luaH_getstr
17028     0.3071  asd                      asd                      index2adr
14524     0.2619  asd                      asd                      shash_get
14016     0.2528  oprofiled                oprofiled                /usr/bin/oprofiled
13964     0.2518  asd                      asd                      as_pack_val
13727     0.2475  libc-2.12.so             libc-2.12.so             __strftime_internal
13611     0.2454  libc-2.12.so             libc-2.12.so             __strlen_sse42
13500     0.2434  asd                      asd                      cf_fault_event
12279     0.2214  asd                      asd                      luaC_separateudata
11363     0.2049  libc-2.12.so             libc-2.12.so             strchrnul
10090     0.1820  asd                      asd                      as_record_done
9658      0.1742  asd                      asd                      realloc
8995      0.1622  asd                      asd                      luaC_step
8955      0.1615  asd                      asd                      luaD_poscall
8930      0.1610  asd                      asd                      mod_lua_toval
8652      0.1560  libc-2.12.so             libc-2.12.so             __strncpy_ssse3
8522      0.1537  asd                      asd                      cf_malloc_at
8307      0.1498  asd                      asd                      as_index_get_insert_vlock
8283      0.1494  asd                      asd                      cf_vmapx_hash_fn
8109      0.1462  asd                      asd                      as_string_len
7314      0.1319  asd                      asd                      GCTM
7145      0.1288  asd                      asd                      cf_free_at
7002      0.1263  asd                      asd                      lua_getfield
6998      0.1262  dbtest                   dbtest                   do_the_full_monte
6977      0.1258  libc-2.12.so             libc-2.12.so             _int_malloc
6611      0.1192  asd                      asd                      udf_record_param_check.clone.0
6437      0.1161  asd                      asd                      find_sync_copy
6411      0.1156  asd                      asd                      lua_getmetatable
6305      0.1137  asd                      asd                      luaL_checkudata
6133      0.1106  asd                      asd                      linear_histogram_insert_data_point
6028      0.1087  asd                      asd                      luaT_gettmbyobj

Bytes as Outline Data
#2

Post by Toby » Sun Aug 03, 2014 3:27 pm

Yes, we’re aware of the current LDT performance. We’re working on it.

LDTs are relatively young in their evolutionary cycle. Although, at this point in time, Large List (which can do many of the same things as Large Map), is a bit more evolved and has shown better performance than Large Map.

There several improvements in the works. Stay tuned.


Postby Hanson » Sun Aug 03, 2014 8:37 pm

Good to hear that the improvement is in-progress.

I tried with llist LDT Bin as suggested, but failed by adding binary data (typically 1024 bytes for each used in our project):

err_code: 1300, err_msg: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:1282: bad argument #2 to '?' (number expected, got string)

Here is the piece of C code:

    ...
    //Using llist LDT Bin with bytes
    as_ldt llist;
    as_ldt_init(&llist, "myllist", AS_LDT_LLIST, NULL);

    as_bytes bval;
    as_bytes_inita(&bval, 1024);
    as_bytes_append(&bval, (uint8_t*)data, 1024);

    as_error err;
    as_status status = aerospike_llist_add(connection, &err, NULL, &key, &llist, (as_val *)&bval);
    ...

Then I changed it to add integer instead:

    ...
    //Using llist LDT Bin with integer
    as_ldt llist;
    as_ldt_init(&llist, "myllist", AS_LDT_LLIST, NULL);

    as_integer ival;
    as_integer_init(&ival, 123);

    as_error err;
    as_status status = aerospike_llist_add(connection, &err, NULL, &key, &llist, (as_val *)&ival);
    ...

It works, but got worse performance: 9K TPS with 4 threads of Client (Comparing to 10K TPS when using lmap LDT Bin). And the average latency is 0.45ms, also worse than using lmap LDT Bin which is 0.4ms.

The LDT is implemented by built-in UDF. Looks like the Lua used in UDF is a performance killer. Any plan to migrate those built-in UDF from Lua to C ?


Post by Hanson » Wed Aug 06, 2014 1:40 am

Also I observed that the TPS is unstable when the number of lmap items (sub-records, each with 1KB size of bytes) in a LDT Bin exceeds 20. Shaking between 4K TPS to 10K TPS: stay at 4K TPS for ~3 seconds, then 10K TPS for ~9 seconds, and looping …


#3

Since my Use Case has limited number of sub-records (dynamically in 1 ~ 30) per PK, so I turned to use multiple Bins (type of bytes) for it, and use the Bin name as sub-key. Thus get rid of the performance issue of LDT !


#4

One of the tricks to getting the best performance out of LLIST (and LSET/LMAP for that matter) is to use a small value (field of the object) for comparison (i.e. the key) and keep the rest of the object as the pay load. That way, the entire object does not need to participate in the comparisons O(log(n)) search.

We have an example that uses this technique – I’ll post the details tomorrow. The quick overview of this trick as as follows: The data object is a map with two fields: One field named “key”, which is an integer field that can be used for ordering, and the other field named “data”, which is the large byte array. The “key” field is used for comparison and ordering, and the “data” field simply goes along for the ride. Let me know if this sort of approach would work for you.

Also, we’re in the middle of UDF performance enhancements, which will directly affect the LDT performance. We’re finding several improvements that we can make.

Stay tuned.

Toby


#5

Hi Toby,

I am using llist so I’d like to know how the performance improvements going and are there any new bench mark?

Thanks, Ming


#6

lming,

We are doing some performance improvements. The changes would show up next release or so.

– R


#7

Hi, any news on this? I am wondering whether this is caused by any locking involved in the process (like benchmarking usage of only 1 record’s LDT bin) or whether it’s indeed the LDTs performance that is limiting this very important data type…

Could you provide a rough ratio of how much worse any LDT operation performs compared to simple bin updates (no LDT involved)? Also, wouldn’t it be better to keep the payload out of the LDT and just give it a sortable key and an reference to an corresponding record? That would also normalize the model, in case the record changes.

My use case is to use the LOL-LDT kinda like an secondary index, only that I need a persistent one that is able to grow to an infinite size and doesn’t pollute the memory. If my server is capable of doing 200k (simple) TPS, what can I expect from a LOL with let’s say 100k items in terms of performance? Both latency and throughput-wise. It would be good to know rough estimations, before developing something and committing to use AS.

Cheers Manuel


#8

Manuel,

We have done some short term performance improvements in the way the LDT behaves in >=3.5.9.

That is how it is organized, Large List is a b+tree by implementation i.e the root and non-leaf node only contain sort keys. The data is only in the leaf nodes.

Large List in general is NOT for very high performance KVS kind of workload. Reason being

  • Llist allows storage and operation on arbitrary document as part of collection.
  • Secondly it stores both data and access path organized in a single record. Hence has overhead of few extra IO compared to simply KVS operation.

It is not simple to extrapolate performance characteristic of LLIST based on KVS performance. It depends on lot of factors.

  • How big is your item size / key size?
  • What is your LDT page Size set to. It is 8k by default
  • Are you doing add() or add_all(). In general batching write operation should work much faster.
  • Specific hardware you are running on.

That said based on the answers to above questions I can suggest tuning … then the best way to find out what you get is to try it out on sample workload with candidate hardware.

– R


#9

@Hanson, @lming and @ManuelSchmidt:

Thank you for posting about LDTs in our forum. Please see the LDT Feature Guide for current LDT recommendations and best practices.


#10

@Hanson, @lming and @ManuelSchmidt ,

Effective immediately, we will no longer actively support the LDT feature and will eventually remove the API. The exact deprecation and removal timeline will depend on customer and community requirements. Instead of LDTs, we advise that you use our newer List and SortedMap APIs, which are now available in all Aerospike-supported clients at the General Availability level. Read our blog post for details.