AS server stops randomly (UDF warnings due to write operation record size above 128KB)


#1

Hi, I use Aerospike Community Edition build 3.5.3 on ubuntu 12.04 with 2 servers.
Last few weeks, it started to stop randomly. Here is some logs; (removing INFO level logs)

    Sep 12 2015 17:56:49 GMT: WARNING (udf): (udf_aerospike.c::178) udf__aerospike_get_particle_buf: Invalid Operation [Bin c data too big size=131201]... Fail
Sep 12 2015 17:56:49 GMT: WARNING (udf): (udf_aerospike.c::459) udf_aerospike_setbin: Allocation Error [Map-List: bin c data size too big: pbytes 131201]... Fail
Sep 12 2015 17:57:34 GMT: WARNING (udf): (udf_aerospike.c::178) udf__aerospike_get_particle_buf: Invalid Operation [Bin c data too big size=131201]... Fail
Sep 12 2015 17:57:34 GMT: WARNING (udf): (udf_aerospike.c::459) udf_aerospike_setbin: Allocation Error [Map-List: bin c data size too big: pbytes 131201]... Fail
Sep 12 2015 20:15:57 GMT: WARNING (paxos): (paxos.c::2119) quorum visibility lost! Continuing anyway ...
Sep 12 2015 20:15:57 GMT: WARNING (paxos): (paxos.c::1893) quorum visibility lost! Continuing anyway ...
Sep 13 2015 01:00:13 GMT: WARNING (udf): (udf_aerospike.c::178) udf__aerospike_get_particle_buf: Invalid Operation [Bin c data too big size=131192]... Fail
Sep 13 2015 01:00:13 GMT: WARNING (udf): (udf_aerospike.c::459) udf_aerospike_setbin: Allocation Error [Map-List: bin c data size too big: pbytes 131192]... Fail
Sep 13 2015 02:02:33 GMT: WARNING (udf): (udf_aerospike.c::178) udf__aerospike_get_particle_buf: Invalid Operation [Bin c data too big size=131201]... Fail
Sep 13 2015 02:02:33 GMT: WARNING (udf): (udf_aerospike.c::459) udf_aerospike_setbin: Allocation Error [Map-List: bin c data size too big: pbytes 131201]... Fail
Sep 13 2015 02:04:35 GMT: WARNING (udf): (udf_aerospike.c::178) udf__aerospike_get_particle_buf: Invalid Operation [Bin c data too big size=131201]... Fail

Just after 20:15 it stopped.
Any idea?
Configuration below;

service {
        user root
        group root
        paxos-single-replica-limit 2 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 4
        transaction-queues 4
        transaction-threads-per-queue 4
        proto-fd-max 15000
}
logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}
network {
        service {
                address any
                port 3000
        }
        heartbeat {
                mode mesh
                port 3002
                mesh-address 172.31.33.237
                mesh-port 3002
                interval 150
                timeout 20
        }
        fabric {
                port 3001
        }
        info {
                port 3003
        }
}
namespace rtb {
        replication-factor 2
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.
        storage-engine device {
                device /dev/xvdb
                write-block-size 128K
                defrag-lwm-pct 50
                defrag-startup-minimum 10
        }
}
namespace feed {
        replication-factor 2
        memory-size 3G
        default-ttl 30d # 30 days, use 0 to never expire/evict.
        storage-engine memory
}
namespace test {
        replication-factor 2
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.
        storage-engine memory
}
mod-lua {
  cache-enabled false
}



And here is asinfo output;

asinfo -l
1 :  node
     BB9BD70D0881F0A
2 :  statistics
     cluster_size=2;cluster_key=3EB65F27C90B3D71;cluster_integrity=true;objects=7926866;sub-records=0;total-bytes-disk=32211206144;used-bytes-disk=4435426816;free-pct-disk=86;total-bytes-memory=5368709120;used-bytes-memory=507319424;data-used-bytes-memory=0;index-used-bytes-memory=507319424;sindex-used-bytes-memory=0;free-pct-memory=90;stat_read_reqs=244133;stat_read_reqs_xdr=0;stat_read_success=244133;stat_read_errs_notfound=0;stat_read_errs_other=0;stat_write_reqs=1383509;stat_write_reqs_xdr=0;stat_write_success=1383509;stat_write_errs=0;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=316674;stat_rw_timeout=0;udf_read_reqs=244133;udf_read_success=244133;udf_read_errs_other=0;udf_write_reqs=1066835;udf_write_success=1066835;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=13;stat_proxy_reqs_xdr=0;stat_proxy_success=13;stat_proxy_errs=0;stat_ldt_proxy=0;stat_cluster_key_trans_to_proxy_retry=0;stat_cluster_key_transaction_reenqueue=0;stat_slow_trans_queue_push=0;stat_slow_trans_queue_pop=0;stat_slow_trans_queue_batch_pop=0;stat_cluster_key_regular_processed=0;stat_cluster_key_prole_retry=0;stat_cluster_key_err_ack_dup_trans_reenqueue=0;stat_cluster_key_partition_transaction_queue_count=0;stat_cluster_key_err_ack_rw_trans_reenqueue=0;stat_expired_objects=316674;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=316589;err_tsvc_requests=0;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=1;err_rw_pending_limit=0;err_rw_cant_put_unique=0;fabric_msgs_sent=228660;fabric_msgs_rcvd=228644;paxos_principal=BB9BD70D0881F0A;migrate_msgs_sent=202329;migrate_msgs_recv=226909;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=12084;migrate_num_incoming_refused=0;queue=0;transactions=2655284;reaped_fds=0;tscan_initiate=0;tscan_pending=0;tscan_succeeded=0;tscan_aborted=0;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=13;proxy_action=4;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=1383509;write_prole=0;read_dup_prole=2190;rw_err_dup_internal=0;rw_err_dup_cluster_key=0;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=0;rw_err_ack_badnode=0;client_connections=15;waiting_transactions=0;tree_count=0;record_refs=7926866;record_locks=0;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=6042;partition_replica=0;partition_desync=0;partition_absent=6246;partition_object_count=7926866;partition_ref_count=12288;system_free_mem_pct=90;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=0;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=611;uptime=168358;stat_write_errs_notfound=0;stat_write_errs_other=0;heartbeat_received_self=0;heartbeat_received_foreign=1216041;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_queue_full=0;query_long_queue_full=0;query_short_running=0;query_long_running=0;query_tracked=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0
3 :  features
     as_msg;replicas-read;replicas-prole;replicas-write;replicas-master;cluster-generation;partition-info;partition-generation;udf
4 :  cluster-generation
     3
5 :  partition-generation
     37092
6 :  edition
     Aerospike Community Edition
7 :  version
     Aerospike Community Edition build 3.5.3
8 :  build
     3.5.3
9 :  services
     172.31.33.237:3000
10 :  services-alumni
     172.31.33.237:3000

#2

y_srk,

the UDF warning are related to the operation where you are trying to create a record with more than 128K size, the write_block_size is 128K in namespace rtb.

If you ensure that your write operation record size is below 128KB size, this problem should go away.

We have recently released Aerospike v3.6.0. This build has a lot of features and improvements, you may want to consider upgrading your cluster to this latest release.

-samir


#3

Thanks Samir, I increased the write-block-size parameter and planing to upgrade. Thanks for advices.