Timeouts on mostly idle cluster


#1

I have a development cluster that’s almost totally idle. There’s only a small amount of data at the moment.

CPU, Memory, IO, everything is basically idle, latency between machines is low (same placement group, enhanced networking enabled on AWS), yet we occasionally see timeouts and if I leave an aql terminal open I see this kind of thing after a while:

2015-10-28 16:17:39 WARN Invalid services address: ;172.25.21.102:3000
2015-10-28 16:17:40 WARN Invalid services address: ;172.25.21.101:3000
2015-10-28 16:18:58 WARN Invalid services address: ;172.25.21.101:3000
2015-10-28 16:22:18 WARN AEROSPIKE_ERR_TIMEOUT 
2015-10-28 16:22:40 WARN Invalid services address: ;172.25.21.102:3000
2015-10-28 16:22:48 WARN Invalid services address: ;172.25.21.102:3000

Running on the latest Aerospike build.

Clients are also occasionally getting the timeouts and the client requests are all super basic and should be nearly instant (in almost every case they are as well, just this edge case with it occasionally slowing dramatically).


#2

Also here’s the stats from asadm:

~~~~~~~~~~~~~~~~~store_memory Bin Statistics~~~~~~~~~~~~~~~~~
NODE           :   172.25.21.100   172.25.21.101   172.25.21.102   
bin-names-quota:   32768          32768          32768          
num-bin-names  :   0              4              4              

~~~~~~~~~~~~~~~~~~store_disk Bin Statistics~~~~~~~~~~~~~~~~~~
NODE           :   172.25.21.100   172.25.21.101   172.25.21.102   
bin-names-quota:   32768          32768          32768          
num-bin-names  :   145            159            159            
 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                                              :   172.25.21.100       172.25.21.101       172.25.21.102       
aggr_scans_failed                                 :   0                  0                  0                  
aggr_scans_succeeded                              :   0                  0                  0                  
basic_scans_failed                                :   0                  0                  0                  
basic_scans_succeeded                             :   30                 30                 30                 
batch_errors                                      :   0                  0                  0                  
batch_index_complete                              :   37812              39000              36729              
batch_index_errors                                :   0                  0                  0                  
batch_index_initiate                              :   37812              39000              36729              
batch_index_queue                                 :   0:0,0:0,0:0,0:0    0:0,0:0,0:0,0:0    0:0,0:0,0:0,0:0    
batch_index_timeout                               :   0                  0                  0                  
batch_index_unused_buffers                        :   3                  3                  3                  
batch_initiate                                    :   0                  0                  0                  
batch_queue                                       :   0                  0                  0                  
batch_timeout                                     :   0                  0                  0                  
batch_tree_count                                  :   0                  0                  0                  
client_connections                                :   46                 45                 45                 
cluster_integrity                                 :   true               true               true               
cluster_key                                       :   2F7AB1DE6F059951   2F7AB1DE6F059951   2F7AB1DE6F059951   
cluster_size                                      :   3                  3                  3                  
data-used-bytes-memory                            :   433498018          426735070          428693695          
delete_queue                                      :   0                  0                  0                  
err_duplicate_proxy_request                       :   0                  0                  0                  
err_out_of_space                                  :   0                  0                  0                  
err_replica_non_null_node                         :   0                  0                  0                  
err_replica_null_node                             :   0                  0                  0                  
err_rw_cant_put_unique                            :   0                  0                  0                  
err_rw_pending_limit                              :   0                  0                  0                  
err_rw_request_not_found                          :   3                  3                  3                  
err_storage_queue_full                            :   0                  0                  0                  
err_sync_copy_null_master                         :   0                  0                  0                  
err_tsvc_requests                                 :   5401237            5556998            5543907            
err_tsvc_requests_timeout                         :   6                  2                  3                  
err_write_fail_bin_exists                         :   0                  0                  0                  
err_write_fail_bin_name                           :   0                  0                  0                  
err_write_fail_bin_not_found                      :   0                  0                  0                  
err_write_fail_forbidden                          :   0                  0                  0                  
err_write_fail_generation                         :   0                  0                  0                  
err_write_fail_generation_xdr                     :   0                  0                  0                  
err_write_fail_incompatible_type                  :   0                  0                  0                  
err_write_fail_key_exists                         :   0                  0                  0                  
err_write_fail_key_mismatch                       :   0                  0                  0                  
err_write_fail_not_found                          :   0                  0                  0                  
err_write_fail_noxdr                              :   0                  0                  0                  
err_write_fail_parameter                          :   0                  0                  0                  
err_write_fail_prole_delete                       :   0                  0                  0                  
err_write_fail_prole_generation                   :   0                  0                  0                  
err_write_fail_prole_unknown                      :   0                  0                  0                  
err_write_fail_record_too_big                     :   0                  0                  0                  
err_write_fail_unknown                            :   0                  0                  0                  
fabric_msgs_rcvd                                  :   3020716            4076282            5201987            
fabric_msgs_sent                                  :   3020729            4076310            5202008            
free-pct-disk                                     :   97                 97                 97                 
free-pct-memory                                   :   96                 96                 96                 
heartbeat_received_foreign                        :   13371222           13397354           13408547           
heartbeat_received_self                           :   0                  0                  0                  
index-used-bytes-memory                           :   61225920           60300480           60573440           
info_queue                                        :   0                  0                  0                  
migrate_msgs_recv                                 :   777321             1925373            3070153            
migrate_msgs_sent                                 :   773217             1916496            3056501            
migrate_num_incoming_accepted                     :   0                  0                  0                  
migrate_num_incoming_refused                      :   0                  0                  0                  
migrate_progress_recv                             :   0                  0                  0                  
migrate_progress_send                             :   0                  0                  0                  
migrate_rx_objs                                   :   0                  0                  0                  
migrate_tx_objs                                   :   0                  0                  0                  
objects                                           :   956637             942182             946449             
ongoing_write_reqs                                :   0                  0                  0                  
partition_absent                                  :   4090               4110               4088               
partition_actual                                  :   2696               2740               2756               
partition_desync                                  :   0                  0                  0                  
partition_object_count                            :   956637             942182             946449             
partition_ref_count                               :   8192               8192               8192               
partition_replica                                 :   1406               1342               1348               
paxos_principal                                   :   BB9A7DAE2D1540A    BB9A7DAE2D1540A    BB9A7DAE2D1540A    
proxy_action                                      :   5                  1                  2                  
proxy_in_progress                                 :   0                  0                  0                  
proxy_initiate                                    :   0                  4                  3                  
proxy_retry                                       :   0                  0                  0                  
proxy_retry_new_dest                              :   0                  0                  0                  
proxy_retry_q_full                                :   0                  0                  0                  
proxy_retry_same_dest                             :   0                  0                  0                  
proxy_unproxy                                     :   0                  0                  0                  
query_abort                                       :   0                  0                  0                  
query_agg                                         :   565985             566126             566179             
query_agg_abort                                   :   0                  0                  0                  
query_agg_avg_rec_count                           :   0                  0                  0                  
query_agg_err                                     :   0                  0                  0                  
query_agg_success                                 :   565985             566126             566179             
query_avg_rec_count                               :   0                  0                  0                  
query_fail                                        :   0                  0                  0                  
query_long_queue_full                             :   0                  0                  0                  
query_long_reqs                                   :   1235               1264               1249               
query_long_running                                :   0                  0                  0                  
query_lookup_abort                                :   0                  0                  0                  
query_lookup_avg_rec_count                        :   0                  0                  0                  
query_lookup_err                                  :   0                  0                  0                  
query_lookup_success                              :   49215              49333              41978              
query_lookups                                     :   49215              49333              41978              
query_reqs                                        :   615203             615523             615646             
query_short_queue_full                            :   0                  0                  0                  
query_short_reqs                                  :   613965             614195             606908             
query_short_running                               :   0                  0                  0                  
query_success                                     :   615200             615459             608157             
queue                                             :   0                  0                  0                  
read_dup_prole                                    :   0                  0                  0                  
reaped_fds                                        :   20458              14291              16436              
record_locks                                      :   0                  0                  0                  
record_refs                                       :   956655             942195             946460             
rw_err_ack_badnode                                :   0                  0                  0                  
rw_err_ack_internal                               :   0                  0                  0                  
rw_err_ack_nomatch                                :   168                209                166                
rw_err_dup_cluster_key                            :   0                  0                  0                  
rw_err_dup_internal                               :   0                  0                  0                  
rw_err_dup_send                                   :   0                  0                  0                  
rw_err_write_cluster_key                          :   0                  0                  0                  
rw_err_write_internal                             :   0                  0                  0                  
rw_err_write_send                                 :   0                  0                  0                  
scans_active                                      :   0                  0                  0                  
sindex-used-bytes-memory                          :   29201498           30202393           30350817           
sindex_gc_activity_dur                            :   1340556            1284721            1314246            
sindex_gc_garbage_cleaned                         :   59                 339347             339011             
sindex_gc_garbage_found                           :   60                 339849             339245             
sindex_gc_inactivity_dur                          :   1003694445         1007437280         1008720754         
sindex_gc_list_creation_time                      :   1334468            1278568            1308048            
sindex_gc_list_deletion_time                      :   1099               1647               1675               
sindex_gc_locktimedout                            :   0                  0                  1                  
sindex_gc_objects_validated                       :   677873856          677037414          679100843          
sindex_ucgarbage_found                            :   0                  0                  0                  
stat_cluster_key_err_ack_dup_trans_reenqueue      :   0                  0                  0                  
stat_cluster_key_err_ack_rw_trans_reenqueue       :   0                  0                  0                  
stat_cluster_key_partition_transaction_queue_count:   0                  0                  0                  
stat_cluster_key_prole_retry                      :   0                  0                  0                  
stat_cluster_key_regular_processed                :   0                  0                  0                  
stat_cluster_key_trans_to_proxy_retry             :   0                  0                  0                  
stat_cluster_key_transaction_reenqueue            :   0                  0                  0                  
stat_delete_success                               :   83229              82451              82797              
stat_deleted_set_objects                          :   0                  0                  0                  
stat_duplicate_operation                          :   0                  0                  0                  
stat_evicted_objects                              :   0                  0                  0                  
stat_evicted_objects_time                         :   0                  0                  0                  
stat_expired_objects                              :   9300               10083              9864               
stat_ldt_proxy                                    :   0                  0                  0                  
stat_nsup_deletes_not_shipped                     :   19714              19598              19175              
stat_proxy_errs                                   :   0                  0                  0                  
stat_proxy_reqs                                   :   0                  4                  3                  
stat_proxy_reqs_xdr                               :   0                  0                  0                  
stat_proxy_success                                :   0                  4                  3                  
stat_read_errs_notfound                           :   373481             408426             389818             
stat_read_errs_other                              :   0                  0                  0                  
stat_read_reqs                                    :   1023440            950927             956654             
stat_read_reqs_xdr                                :   0                  0                  0                  
stat_read_success                                 :   649959             542501             566836             
stat_rw_timeout                                   :   168                197                166                
stat_slow_trans_queue_batch_pop                   :   0                  0                  0                  
stat_slow_trans_queue_pop                         :   0                  0                  0                  
stat_slow_trans_queue_push                        :   0                  0                  0                  
stat_write_errs                                   :   0                  0                  0                  
stat_write_errs_notfound                          :   0                  0                  0                  
stat_write_errs_other                             :   0                  0                  0                  
stat_write_reqs                                   :   1120817            1017989            999812             
stat_write_reqs_xdr                               :   0                  0                  0                  
stat_write_success                                :   1120649            1017793            999648             
stat_xdr_pipe_miss                                :   0                  0                  0                  
stat_xdr_pipe_writes                              :   0                  0                  0                  
stat_zero_bin_records                             :   0                  0                  0                  
storage_defrag_corrupt_record                     :   0                  0                  0                  
sub-records                                       :   18                 13                 11                 
system_free_mem_pct                               :   93                 92                 92                 
system_swapping                                   :   false              false              false              
total-bytes-disk                                  :   32212254720        32211206144        32212254720        
total-bytes-memory                                :   17179869184        17179869184        17179869184        
transactions                                      :   37547020           37623715           37691928           
tree_count                                        :   0                  0                  0                  
udf_bg_scans_failed                               :   0                  0                  0                  
udf_bg_scans_succeeded                            :   0                  0                  0                  
udf_delete_err_others                             :   0                  0                  0                  
udf_delete_reqs                                   :   59                 63                 58                 
udf_delete_success                                :   59                 63                 58                 
udf_lua_errs                                      :   0                  0                  0                  
udf_query_rec_reqs                                :   0                  0                  0                  
udf_read_errs_other                               :   1                  2                  1                  
udf_read_reqs                                     :   151                232                141                
udf_read_success                                  :   150                230                140                
udf_replica_writes                                :   609                343                62                 
udf_scan_rec_reqs                                 :   0                  0                  0                  
udf_write_err_others                              :   0                  0                  0                  
udf_write_reqs                                    :   1087814            983046             962413             
udf_write_success                                 :   1087814            983046             962413             
uptime                                            :   1006944            1010659            1011969            
used-bytes-disk                                   :   719153024          708126464          711290752          
used-bytes-memory                                 :   523925436          517237943          519617952          
waiting_transactions                              :   0                  0                  0                  
write_master                                      :   1120817            1017989            999812             
write_prole                                       :   1082486            1090510            1089060            

~~~~~~~~~~~~~~~~~~~~~~~store_disk Namespace Statistics~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   172.25.21.100   172.25.21.101   172.25.21.102   
allow-nonxdr-writes            :   true           true           true           
allow-xdr-writes               :   true           true           true           
available-bin-names            :   32623          32609          32609          
available_pct                  :   97             97             97             
cold-start-evict-ttl           :   4294967295     4294967295     4294967295     
conflict-resolution-policy     :   generation     generation     generation     
current-time                   :   183753415      183754437      183753785      
data-in-memory                 :   true           true           true           
data-used-bytes-memory         :   433498018      426734349      428693695      
default-ttl                    :   0              0              0              
defrag-lwm-pct                 :   50             50             50             
defrag-queue-min               :   0              0              0              
defrag-sleep                   :   1000           1000           1000           
defrag-startup-minimum         :   10             10             10             
dev                            :   /dev/xvdb      /dev/xvdb      /dev/xvdb      
disallow-null-setname          :   false          false          false          
enable-xdr                     :   false          false          false          
evict-tenths-pct               :   5              5              5              
evicted-objects                :   0              0              0              
expired-objects                :   9300           10083          9864           
filesize                       :   17179869184    17179869184    17179869184    
flush-max-ms                   :   1000           1000           1000           
free-pct-disk                  :   97             97             97             
free-pct-memory                :   96             96             96             
fsync-max-sec                  :   0              0              0              
high-water-disk-pct            :   50             50             50             
high-water-memory-pct          :   60             60             60             
hwm-breached                   :   false          false          false          
index-used-bytes-memory        :   61225920       60300352       60573440       
ldt-delete-success             :   0              0              0              
ldt-deletes                    :   0              0              0              
ldt-enabled                    :   true           true           true           
ldt-err-bin-damaged            :   0              0              0              
ldt-err-bin-does-not-exist     :   0              0              0              
ldt-err-bin-exits              :   0              0              0              
ldt-err-capacity-exceeded      :   0              0              0              
ldt-err-createspec             :   0              0              0              
ldt-err-delete-fail            :   0              0              0              
ldt-err-filer                  :   0              0              0              
ldt-err-input-too-large        :   0              0              0              
ldt-err-insert-fail            :   0              0              0              
ldt-err-internal               :   0              0              0              
ldt-err-item-notfound          :   0              0              0              
ldt-err-key                    :   0              0              0              
ldt-err-ldt-not-enabled        :   0              0              0              
ldt-err-op-bintype-mismatch    :   0              0              0              
ldt-err-param                  :   0              0              0              
ldt-err-search-fail            :   0              0              0              
ldt-err-subrec-internal        :   0              0              0              
ldt-err-subrec-not-found       :   0              0              0              
ldt-err-too-many-open-subrec   :   0              0              0              
ldt-err-toprec-internal        :   0              0              0              
ldt-err-toprec-notfound        :   0              0              0              
ldt-err-unique-key-violation   :   0              0              0              
ldt-err-unknown                :   1              2              1              
ldt-err-usermodule             :   0              0              0              
ldt-err-version-mismatch       :   0              0              0              
ldt-errors                     :   0              0              0              
ldt-gc-cnt                     :   0              0              0              
ldt-gc-io                      :   2145354        2803964        2309230        
ldt-page-size                  :   8192           8192           8192           
ldt-randomizer-retry           :   0              0              0              
ldt-read-success               :   142            216            137            
ldt-reads                      :   142            216            137            
ldt-updates                    :   0              0              0              
ldt-write-success              :   427            812            444            
ldt-writes                     :   427            812            444            
master-objects                 :   468368         475980         478285         
master-sub-objects             :   5              5              11             
max-ttl                        :   0              0              0              
max-void-time                  :   183839675      183840547      183840135      
max-write-cache                :   67108864       67108864       67108864       
memory-size                    :   16106127360    16106127360    16106127360    
min-avail-pct                  :   5              5              5              
non-expirable-objects          :   467665         475224         477558         
ns-forward-xdr-writes          :   false          false          false          
nsup-cycle-duration            :   0              0              0              
nsup-cycle-sleep-pct           :   0              0              0              
obj-size-hist-max              :   100            100            100            
objects                        :   956637         942180         946449         
post-write-queue               :   0              0              0              
prole-objects                  :   488269         466200         468164         
prole-sub-objects              :   13             8              0              
read-consistency-level-override:   off            off            off            
repl-factor                    :   2              2              2              
set-deleted-objects            :   0              0              0              
sets-enable-xdr                :   true           true           true           
sindex-used-bytes-memory       :   29201498       30202393       30350817       
single-bin                     :   false          false          false          
stop-writes                    :   false          false          false          
stop-writes-pct                :   90             90             90             
sub-objects                    :   18             13             11             
total-bytes-disk               :   32212254720    32211206144    32212254720    
total-bytes-memory             :   16106127360    16106127360    16106127360    
type                           :   device         device         device         
used-bytes-disk                :   719153024      708126464      711290752      
used-bytes-memory              :   523925436      517237094      519617952      
write-commit-level-override    :   off            off            off            
writecache                     :   67108864       67108864       67108864       
writethreads                   :   1              1              1

#3

Also here’s the config on the dev cluster (same thing on all 3 nodes):

service {
        user root
        group root
        paxos-single-replica-limit 1
        pidfile /var/run/aerospike/asd.pid
        service-threads 4
        transaction-queues 4
        transaction-threads-per-queue 3
        proto-fd-max 15000
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any warn
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002
                mesh-address 172.25.21.101
                mesh-port 3002
                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}
namespace store_disk {
        replication-factor 2
        memory-size 15G
        default-ttl 0 # 30 days, use 0 to never expire/evict.
        ldt-enabled true

        storage-engine device {
                device /dev/xvdb
                scheduler-mode noop
                write-block-size 128K
                data-in-memory true
        }
}

The setup is all relatively simple as you can see. With such a simple setup, not very much load, I’m not entirely sure why we’re seeing that behavior on all 3 nodes.

I feel like we’re missing something obvious. :slight_smile:


#4

Also we did have the service running under an aerospike user, we’ve been simplifying the config moving toward the default config bit by bit to see if the timeouts go away, and as you can see we ended up with a configuration that’s almost exactly the default.


#5

HI Courtneyc,

can you update your aerospike config file according and let me know.

service {
    address any
    port 3000
    access-address <private ip of each server>
            network-interface-name eth0
}

heartbeat {
    mode mesh
    port 3002 # Heartbeat port for this node.
    address <private IP> 
    # List one or more other nodes, one ip-address & port per line:
    mesh-seed-address-port IP <3002> # IP and Port no.
    mesh-seed-address-port IP <3002>  # IP and Port no.
    mesh-seed-address-port IP < 3002>  # IP and Port no.

#6

@courtneyc

I had a similar with timeouts when accessing aws aerospike from a non-aws dev box - check out this - Reply with service IPs on interface where client asks