UDF Timeout on LLIST update, Aerospike v3.5.2 and 3.5.4


#1

Hi, I am getting an exception below using c# client when inserting a value into LLIST, It was working fine with 5k+ TPS, then gradually slow down to aprox. 300TPS, then I started getting timeouts on list.Update(Value.GetAsMap(dict)); Aerospike server hardware: Azure VM 4 core 7GB ram Client: Azure VM 4 core 7GB ram in the same datacenter.

Database status: Master Objects: 1.3 Million, each has a LLIST bin with 14-15 entries, Namespace configured to write to disk and store only indexes in memory. Allocated file size 245 GB, used by data so far only 10GB

Code: LargeList list = _asyncClient.GetLargeList(_writePolicy, key, “eod”, null); var dict = new Dictionary<string, Value>(); dict[“key”] = Value.Get(obj.DataDate.Date.Ticks); dict[“value”] = Value.Get(obj); list.Update(Value.GetAsMap(dict));

Exception: System.AggregateException: One or more errors occurred. —> Aerospike.Client.AerospikeException: Error Code 100: /opt/aerospike/sys/udf/lua/ldt/ldt_common.lua:1264: UDF Execution Timeout at Aerospike.Client.SyncCommand.Execute() at Aerospike.Client.AerospikeClient.Execute(WritePolicy policy, Key key, String packageName, String functionName, Value[] args) at Aerospike.Client.LargeList.Update(Value value) at ProjectDelta.DataImporter.Managers.AerospikeClient.WriteOption(Option option) at ProjectDelta.DataImporter.Managers.DataManager.<>c__DisplayClass2.b__0(Option option) at System.Threading.Tasks.Parallel.<>c__DisplayClassf1.<ForWorker>b__c() at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask) at System.Threading.Tasks.Task.<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(Object param0) --- End of inner exception stack trace --- at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at System.Threading.Tasks.Task.Wait() at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action1 body, Action2 bodyWithState, Func4 bodyWithLocal, Func1 localInit, Action1 localFinally) at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable1 source, ParallelOptions parallelOptions, Action1 body, Action2 bodyWithState, Action3 bodyWithStateAndIndex, Func4 bodyWithStateAndLocal, Func5 bodyWithEverything, Func1 localInit, Action1 localFinally) at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable1 source, Action1 body) at ProjectDelta.DataImporter.Managers.DataManager.UploadOptions(IList1 options) ---> (Inner Exception #0) Aerospike.Client.AerospikeException: Error Code 100: /opt/aerospike/sys/udf/lua/ldt/ldt_common.lua:1264: UDF Execution Timeout at Aerospike.Client.SyncCommand.Execute() at Aerospike.Client.AerospikeClient.Execute(WritePolicy policy, Key key, String packageName, String functionName, Value[] args) at Aerospike.Client.LargeList.Update(Value value) at ProjectDelta.DataImporter.Managers.AerospikeClient.WriteOption(Option option) at ProjectDelta.DataImporter.Managers.DataManager.<>c__DisplayClass2.<UploadOptions>b__0(Option option) at System.Threading.Tasks.Parallel.<>c__DisplayClassf1.b__c() at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask) at System.Threading.Tasks.Task.<>c__DisplayClass11.b__10(Object param0)<—

—> (Inner Exception #1) Aerospike.Client.AerospikeException: Error Code 100: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2987: UDF Execution Timeout at Aerospike.Client.SyncCommand.Execute() at Aerospike.Client.AerospikeClient.Execute(WritePolicy policy, Key key, String packageName, String functionName, Value[] args) at Aerospike.Client.LargeList.Update(Value value) at ProjectDelta.DataImporter.Managers.AerospikeClient.WriteOption(Option option) at ProjectDelta.DataImporter.Managers.DataManager.<>c__DisplayClass2.b__0(Option option) at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.b__c() at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask) at System.Threading.Tasks.Task.<>c__DisplayClass11.b__10(Object param0)<—

UPDATE: The same issue observed on bare hardware Ubuntu server with 6 cores 8GB modern motherboard when size reaches over 1mil. objects, one LLIST bin with 14 records per object I suspect the slow down could be due to re-indexing or sorting, if any or lack of it, of records in the LLIST when new record is added to the list, cause if I re-run the same set of data it will maintain the write at 3.5k TPS cause some keys in LLIST already been written before, then suddenly drops to 300TPS when new objects with new keys is added. (not sure if those details help)

Looking forward to your reply


#2

This is not a expected behavior. The performance of LLIST bin has no direct correlation to number of records. Let me re-phrase the problem statement.

I have Aerospike running with about 1 million records. There is LLIST bin in each of them with 14-15 entries in it. If you updating same data TPS is 3.5ktps but when you insert new data it drops down to 300TPS

If this is right? can you paste the output of

$ iostat -xmt 5,10
$ top
$ asinfo -v "statistics" (on the server box)

Checking what is box really doing before theorizing what may probably be happening.

– R


#3

I get hundreds of these in a day. (version 3.5.4).

Here’s some data:

 asinfo -v "statistics" 
cluster_size=2;cluster_key=9B3F0C57210DA60A;cluster_integrity=true;objects=25568841;sub-records=4185539;total-bytes-disk=3399995097088;used-bytes-disk=19847357312;free-pct-disk=99;total-bytes-memory=113816633344;used-bytes-memory=1904280320;data-used-bytes-memory=0;index-used-bytes-memory=1904280320;sindex-used-bytes-memory=0;free-pct-memory=98;stat_read_reqs=4035511;stat_read_reqs_xdr=0;stat_read_success=3983010;stat_read_errs_notfound=52501;stat_read_errs_other=0;stat_write_reqs=297648542;stat_write_reqs_xdr=0;stat_write_success=297648160;stat_write_errs=0;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=9909713;stat_rw_timeout=1015;udf_read_reqs=4035511;udf_read_success=3955020;udf_read_errs_other=80491;udf_write_reqs=272178566;udf_write_success=272178548;udf_write_err_others=18;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=28;stat_proxy_reqs_xdr=0;stat_proxy_success=28;stat_proxy_errs=0;stat_ldt_proxy=0;stat_cluster_key_trans_to_proxy_retry=0;stat_cluster_key_transaction_reenqueue=0;stat_slow_trans_queue_push=0;stat_slow_trans_queue_pop=0;stat_slow_trans_queue_batch_pop=0;stat_cluster_key_regular_processed=0;stat_cluster_key_prole_retry=0;stat_cluster_key_err_ack_dup_trans_reenqueue=0;stat_cluster_key_partition_transaction_queue_count=0;stat_cluster_key_err_ack_rw_trans_reenqueue=0;stat_expired_objects=0;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=0;err_tsvc_requests=20;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=193;err_rw_pending_limit=20;err_rw_cant_put_unique=0;fabric_msgs_sent=595419178;fabric_msgs_rcvd=595419170;paxos_principal=BB9C4F4E711E10A;migrate_msgs_sent=8192;migrate_msgs_recv=16386;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=4096;migrate_num_incoming_refused=0;queue=0;transactions=322236537;reaped_fds=465;tscan_initiate=5;tscan_pending=0;tscan_succeeded=5;tscan_aborted=0;batch_initiate=6401028;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=28;proxy_action=0;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=297648542;write_prole=110660803;read_dup_prole=0;rw_err_dup_internal=0;rw_err_dup_cluster_key=0;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=382;rw_err_ack_badnode=0;client_connections=66;waiting_transactions=0;tree_count=0;record_refs=29754400;record_locks=1;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=2052;partition_replica=2044;partition_desync=0;partition_absent=0;partition_object_count=25568840;partition_ref_count=4097;system_free_mem_pct=93;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=0;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=0;uptime=3554466;stat_write_errs_notfound=0;stat_write_errs_other=0;heartbeat_received_self=0;heartbeat_received_foreign=27449108;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_queue_full=0;query_long_queue_full=0;query_short_running=0;query_long_running=0;query_tracked=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0

iostat

2015-05-19 16:14:15
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,52    0,00    0,15    1,25    0,04   98,05

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0,00     0,56    0,00    0,62     0,00     0,01    22,98     0,00    0,99   0,36   0,02
xvdb              0,51     0,00  383,50    5,08     0,90     0,54     7,58     0,07    0,19   0,17   6,45
xvdc              0,50     0,00  374,04    5,00     0,88     0,54     7,69     0,07    0,19   0,17   6,29
xvdd              0,49     0,00  388,57    5,13     0,92     0,55     7,65     0,08    0,19   0,17   6,55
xvde              0,51     0,00  388,48    5,13     0,92     0,54     7,59     0,07    0,19   0,16   6,37
xvdf              0,00     0,00    0,00   17,40     0,00     2,18   255,98     0,05    2,90   0,31   0,55
bcache0           0,00     0,00  381,81    0,54     0,90     0,54     7,71     0,00    0,18   0,00   0,00
bcache1           0,00     0,00  372,53    0,54     0,88     0,54     7,81     0,00    0,17   0,00   0,00
bcache2           0,00     0,00  386,91    0,55     0,92     0,55     7,77     0,00    0,18   0,00   0,00
bcache3           0,00     0,00  386,81    0,54     0,92     0,54     7,71     0,00    0,17   0,00   0,00

#4

birkir,

Can you get the tree dump for one of they keys for which you see this problem. You can run it with higher timeout value

aql> set timeout 1000000
aql> execute llist.dump('<binname>') on <ns>.<set> where PK='primary key'

The tree will be dumped in the log file. Just copy the part from log.

What is the timeout you are running with ?? Can you share what your TOP looks like ??

You may want to jump over to >=3.5.9 there are few performance improvements which are done for LDT’s in this new build and there is simplified tunability in it.

– R


#5

Well here are some interesting things:

I’m using the go client and I have it retry every time I get this error. It retries 10 times:

2015-05-21T12:06:45+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout
2015-05-21T12:06:47+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2660: UDF Execution Timeout
2015-05-21T12:06:49+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2660: UDF Execution Timeout
2015-05-21T12:06:51+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2660: UDF Execution Timeout
2015-05-21T12:06:54+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout
2015-05-21T12:06:56+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout
2015-05-21T12:06:58+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2660: UDF Execution Timeout
2015-05-21T12:07:01+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout
2015-05-21T12:07:03+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout
2015-05-21T12:07:05+00:00 UDF execution error: /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout
2015-05-21T12:07:06+00:00 log.go:97: fetchFromUserStack tries exceeded limit (users:activity_all:2867095): /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:2874: UDF Execution Timeout

Interestingly the timeout is occurring in different places in the lua script.

I tried running the dump for this set/bin:

aql> SELECT * FROM api.users where PK=2867095
[
  {
    "LDTCONTROLBIN": null,
    "activity_all": null,
    "activity_own": null
  }
]

aql> execute llist.dump('activity_own') on api.users where PK=2867095
Error: (100) /opt/aerospike/sys/udf/lua/ldt/lib_llist.lua:7815: bad argument #1 to 'set_context' (Record expected, got nil)

I get this regardless of user (even ones that I know and see have data in the llist).

Top is not super interesting, but here it is:

top - 12:30:28 up 43 days, 26 min,  1 user,  load average: 8.34, 8.34, 8.36
Tasks: 183 total,   1 running, 182 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 98.5%id,  1.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  125915244k total,  6240816k used, 119674428k free,   163812k buffers
Swap:        0k total,        0k used,        0k free,  1000908k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                
17666 root      20   0 25.1g 3.7g 2152 S 11.3  3.1  12497:22 asd                                                                                                                                                    
 3871 root      20   0  4300   80    0 S  0.3  0.0 165:29.60 rngd                                                                                                                                                   
47497 birkirb   20   0 15252 1304  928 R  0.3  0.0   0:00.02 top                                                                                                                                                    
    1 root      20   0 19604 1616 1292 S  0.0  0.0   0:06.84 init                                                                                                                                                   
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.09 kthreadd                                                                                                                                               
    3 root      20   0     0    0    0 S  0.0  0.0   0:04.66 ksoftirqd/0                                                                                                                                            
    5 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0H  

The only timeout in the conf is for the heartbeat (timeout 25), all others including the go client must be default.

As for upgrading to 3.5.9, if you can confirm this issue has been addressed there I can explore it. But this is on a production system so it’s not a trivial thing. :frowning:

Appreciate any help you can offer.

Sincerely, Birkir


#6

@birkir,

I will wait for @raj to answer. In the meantime, feel free to bookmark our server release notes web page here so you are always aware of the latest server release enhancements, features and fixes.

Since you’re using the Go client, please be aware that we just released Go v.1.5.2 on May 15. You can view its release notes and download it here.

Regards,

Maud


#7

Birkir,

Oh !! the error while trying to dump is caused by bug… it got fixed long back but after 3.5.4 :slight_smile:

The default timeout is 1sec which should get picked up if you do not specify it … Looking at the line number, timeout is seen while performing entire data scan (it could be many IO). With data on disk 1 second may not be good enough for scanning entire LDT if it is huge. What is size of your list in terms of number of items and what is number of items in the largest LLIST you have …

I would suggest you bump up timeout to a larger value for llist.scan calls where you are fetching entire data … You can set write policy before making request http://www.aerospike.com/docs/client/go/usage/kvs/write.html

But whenever you get cycles consider upgrading to 3.5.9, it definitely is a notch faster. And ofcourse has lot of critical bug fixes …

– R


#8

Thank you Raj,

I’ll try bumping up the timeout a bit. Some the the lists can probably get quite large, I’d guess some tens of thousands. I’d be surprised if any were more than 100K. But of course I don’t need to fetch most of the lists, they are usually sorted and most of the time I only need the first 1K or so. But when we made it there was no way to get the client to select the first N entries of a sorted LLIST, but I think something like that was in the works. If it’s ready I’ll look into making use of that and of course attempting an upgrade.

Thanks for your help so far, I’ll let you know if the timeout increase doesn’t help.

Regards, Birkir


#9

Birkir,

Release >=3.5.9 has new llist API

  • llist.find_first(count)
  • llist.find_last(count) returns values in reverse
  • llist.find_from(key,count) including key.

You can get first 1000 record by using find_first.

– R


#10

Thanks Raj,

increasing the timeout worked well (comes back in about 2 sec)…

We’ll try an upgrade and hopefully the Go client will have support for the new LLIST API in the next release. Much appreciated.

Cheers, Birkir


#11

@okonovalenko, @birkir:

Thank you for posting about LDTs in our forum. Please see the LDT Feature Guide for current LDT recommendations and best practices.


#12

@okonovalenko, @birkir:

Effective immediately, we will no longer actively support the LDT feature and will eventually remove the API. The exact deprecation and removal timeline will depend on customer and community requirements. Instead of LDTs, we advise that you use our newer List and SortedMap APIs, which are now available in all Aerospike-supported clients at the General Availability level. Read our blog post for details.