I have a development cluster that’s almost totally idle. There’s only a small amount of data at the moment.
CPU, Memory, IO, everything is basically idle, latency between machines is low (same placement group, enhanced networking enabled on AWS), yet we occasionally see timeouts and if I leave an aql terminal open I see this kind of thing after a while:
2015-10-28 16:17:39 WARN Invalid services address: ;172.25.21.102:3000
2015-10-28 16:17:40 WARN Invalid services address: ;172.25.21.101:3000
2015-10-28 16:18:58 WARN Invalid services address: ;172.25.21.101:3000
2015-10-28 16:22:18 WARN AEROSPIKE_ERR_TIMEOUT
2015-10-28 16:22:40 WARN Invalid services address: ;172.25.21.102:3000
2015-10-28 16:22:48 WARN Invalid services address: ;172.25.21.102:3000
Running on the latest Aerospike build.
Clients are also occasionally getting the timeouts and the client requests are all super basic and should be nearly instant (in almost every case they are as well, just this edge case with it occasionally slowing dramatically).
1 Like
Also here’s the stats from asadm:
~~~~~~~~~~~~~~~~~store_memory Bin Statistics~~~~~~~~~~~~~~~~~
NODE : 172.25.21.100 172.25.21.101 172.25.21.102
bin-names-quota: 32768 32768 32768
num-bin-names : 0 4 4
~~~~~~~~~~~~~~~~~~store_disk Bin Statistics~~~~~~~~~~~~~~~~~~
NODE : 172.25.21.100 172.25.21.101 172.25.21.102
bin-names-quota: 32768 32768 32768
num-bin-names : 145 159 159
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE : 172.25.21.100 172.25.21.101 172.25.21.102
aggr_scans_failed : 0 0 0
aggr_scans_succeeded : 0 0 0
basic_scans_failed : 0 0 0
basic_scans_succeeded : 30 30 30
batch_errors : 0 0 0
batch_index_complete : 37812 39000 36729
batch_index_errors : 0 0 0
batch_index_initiate : 37812 39000 36729
batch_index_queue : 0:0,0:0,0:0,0:0 0:0,0:0,0:0,0:0 0:0,0:0,0:0,0:0
batch_index_timeout : 0 0 0
batch_index_unused_buffers : 3 3 3
batch_initiate : 0 0 0
batch_queue : 0 0 0
batch_timeout : 0 0 0
batch_tree_count : 0 0 0
client_connections : 46 45 45
cluster_integrity : true true true
cluster_key : 2F7AB1DE6F059951 2F7AB1DE6F059951 2F7AB1DE6F059951
cluster_size : 3 3 3
data-used-bytes-memory : 433498018 426735070 428693695
delete_queue : 0 0 0
err_duplicate_proxy_request : 0 0 0
err_out_of_space : 0 0 0
err_replica_non_null_node : 0 0 0
err_replica_null_node : 0 0 0
err_rw_cant_put_unique : 0 0 0
err_rw_pending_limit : 0 0 0
err_rw_request_not_found : 3 3 3
err_storage_queue_full : 0 0 0
err_sync_copy_null_master : 0 0 0
err_tsvc_requests : 5401237 5556998 5543907
err_tsvc_requests_timeout : 6 2 3
err_write_fail_bin_exists : 0 0 0
err_write_fail_bin_name : 0 0 0
err_write_fail_bin_not_found : 0 0 0
err_write_fail_forbidden : 0 0 0
err_write_fail_generation : 0 0 0
err_write_fail_generation_xdr : 0 0 0
err_write_fail_incompatible_type : 0 0 0
err_write_fail_key_exists : 0 0 0
err_write_fail_key_mismatch : 0 0 0
err_write_fail_not_found : 0 0 0
err_write_fail_noxdr : 0 0 0
err_write_fail_parameter : 0 0 0
err_write_fail_prole_delete : 0 0 0
err_write_fail_prole_generation : 0 0 0
err_write_fail_prole_unknown : 0 0 0
err_write_fail_record_too_big : 0 0 0
err_write_fail_unknown : 0 0 0
fabric_msgs_rcvd : 3020716 4076282 5201987
fabric_msgs_sent : 3020729 4076310 5202008
free-pct-disk : 97 97 97
free-pct-memory : 96 96 96
heartbeat_received_foreign : 13371222 13397354 13408547
heartbeat_received_self : 0 0 0
index-used-bytes-memory : 61225920 60300480 60573440
info_queue : 0 0 0
migrate_msgs_recv : 777321 1925373 3070153
migrate_msgs_sent : 773217 1916496 3056501
migrate_num_incoming_accepted : 0 0 0
migrate_num_incoming_refused : 0 0 0
migrate_progress_recv : 0 0 0
migrate_progress_send : 0 0 0
migrate_rx_objs : 0 0 0
migrate_tx_objs : 0 0 0
objects : 956637 942182 946449
ongoing_write_reqs : 0 0 0
partition_absent : 4090 4110 4088
partition_actual : 2696 2740 2756
partition_desync : 0 0 0
partition_object_count : 956637 942182 946449
partition_ref_count : 8192 8192 8192
partition_replica : 1406 1342 1348
paxos_principal : BB9A7DAE2D1540A BB9A7DAE2D1540A BB9A7DAE2D1540A
proxy_action : 5 1 2
proxy_in_progress : 0 0 0
proxy_initiate : 0 4 3
proxy_retry : 0 0 0
proxy_retry_new_dest : 0 0 0
proxy_retry_q_full : 0 0 0
proxy_retry_same_dest : 0 0 0
proxy_unproxy : 0 0 0
query_abort : 0 0 0
query_agg : 565985 566126 566179
query_agg_abort : 0 0 0
query_agg_avg_rec_count : 0 0 0
query_agg_err : 0 0 0
query_agg_success : 565985 566126 566179
query_avg_rec_count : 0 0 0
query_fail : 0 0 0
query_long_queue_full : 0 0 0
query_long_reqs : 1235 1264 1249
query_long_running : 0 0 0
query_lookup_abort : 0 0 0
query_lookup_avg_rec_count : 0 0 0
query_lookup_err : 0 0 0
query_lookup_success : 49215 49333 41978
query_lookups : 49215 49333 41978
query_reqs : 615203 615523 615646
query_short_queue_full : 0 0 0
query_short_reqs : 613965 614195 606908
query_short_running : 0 0 0
query_success : 615200 615459 608157
queue : 0 0 0
read_dup_prole : 0 0 0
reaped_fds : 20458 14291 16436
record_locks : 0 0 0
record_refs : 956655 942195 946460
rw_err_ack_badnode : 0 0 0
rw_err_ack_internal : 0 0 0
rw_err_ack_nomatch : 168 209 166
rw_err_dup_cluster_key : 0 0 0
rw_err_dup_internal : 0 0 0
rw_err_dup_send : 0 0 0
rw_err_write_cluster_key : 0 0 0
rw_err_write_internal : 0 0 0
rw_err_write_send : 0 0 0
scans_active : 0 0 0
sindex-used-bytes-memory : 29201498 30202393 30350817
sindex_gc_activity_dur : 1340556 1284721 1314246
sindex_gc_garbage_cleaned : 59 339347 339011
sindex_gc_garbage_found : 60 339849 339245
sindex_gc_inactivity_dur : 1003694445 1007437280 1008720754
sindex_gc_list_creation_time : 1334468 1278568 1308048
sindex_gc_list_deletion_time : 1099 1647 1675
sindex_gc_locktimedout : 0 0 1
sindex_gc_objects_validated : 677873856 677037414 679100843
sindex_ucgarbage_found : 0 0 0
stat_cluster_key_err_ack_dup_trans_reenqueue : 0 0 0
stat_cluster_key_err_ack_rw_trans_reenqueue : 0 0 0
stat_cluster_key_partition_transaction_queue_count: 0 0 0
stat_cluster_key_prole_retry : 0 0 0
stat_cluster_key_regular_processed : 0 0 0
stat_cluster_key_trans_to_proxy_retry : 0 0 0
stat_cluster_key_transaction_reenqueue : 0 0 0
stat_delete_success : 83229 82451 82797
stat_deleted_set_objects : 0 0 0
stat_duplicate_operation : 0 0 0
stat_evicted_objects : 0 0 0
stat_evicted_objects_time : 0 0 0
stat_expired_objects : 9300 10083 9864
stat_ldt_proxy : 0 0 0
stat_nsup_deletes_not_shipped : 19714 19598 19175
stat_proxy_errs : 0 0 0
stat_proxy_reqs : 0 4 3
stat_proxy_reqs_xdr : 0 0 0
stat_proxy_success : 0 4 3
stat_read_errs_notfound : 373481 408426 389818
stat_read_errs_other : 0 0 0
stat_read_reqs : 1023440 950927 956654
stat_read_reqs_xdr : 0 0 0
stat_read_success : 649959 542501 566836
stat_rw_timeout : 168 197 166
stat_slow_trans_queue_batch_pop : 0 0 0
stat_slow_trans_queue_pop : 0 0 0
stat_slow_trans_queue_push : 0 0 0
stat_write_errs : 0 0 0
stat_write_errs_notfound : 0 0 0
stat_write_errs_other : 0 0 0
stat_write_reqs : 1120817 1017989 999812
stat_write_reqs_xdr : 0 0 0
stat_write_success : 1120649 1017793 999648
stat_xdr_pipe_miss : 0 0 0
stat_xdr_pipe_writes : 0 0 0
stat_zero_bin_records : 0 0 0
storage_defrag_corrupt_record : 0 0 0
sub-records : 18 13 11
system_free_mem_pct : 93 92 92
system_swapping : false false false
total-bytes-disk : 32212254720 32211206144 32212254720
total-bytes-memory : 17179869184 17179869184 17179869184
transactions : 37547020 37623715 37691928
tree_count : 0 0 0
udf_bg_scans_failed : 0 0 0
udf_bg_scans_succeeded : 0 0 0
udf_delete_err_others : 0 0 0
udf_delete_reqs : 59 63 58
udf_delete_success : 59 63 58
udf_lua_errs : 0 0 0
udf_query_rec_reqs : 0 0 0
udf_read_errs_other : 1 2 1
udf_read_reqs : 151 232 141
udf_read_success : 150 230 140
udf_replica_writes : 609 343 62
udf_scan_rec_reqs : 0 0 0
udf_write_err_others : 0 0 0
udf_write_reqs : 1087814 983046 962413
udf_write_success : 1087814 983046 962413
uptime : 1006944 1010659 1011969
used-bytes-disk : 719153024 708126464 711290752
used-bytes-memory : 523925436 517237943 519617952
waiting_transactions : 0 0 0
write_master : 1120817 1017989 999812
write_prole : 1082486 1090510 1089060
~~~~~~~~~~~~~~~~~~~~~~~store_disk Namespace Statistics~~~~~~~~~~~~~~~~~~~~~~~
NODE : 172.25.21.100 172.25.21.101 172.25.21.102
allow-nonxdr-writes : true true true
allow-xdr-writes : true true true
available-bin-names : 32623 32609 32609
available_pct : 97 97 97
cold-start-evict-ttl : 4294967295 4294967295 4294967295
conflict-resolution-policy : generation generation generation
current-time : 183753415 183754437 183753785
data-in-memory : true true true
data-used-bytes-memory : 433498018 426734349 428693695
default-ttl : 0 0 0
defrag-lwm-pct : 50 50 50
defrag-queue-min : 0 0 0
defrag-sleep : 1000 1000 1000
defrag-startup-minimum : 10 10 10
dev : /dev/xvdb /dev/xvdb /dev/xvdb
disallow-null-setname : false false false
enable-xdr : false false false
evict-tenths-pct : 5 5 5
evicted-objects : 0 0 0
expired-objects : 9300 10083 9864
filesize : 17179869184 17179869184 17179869184
flush-max-ms : 1000 1000 1000
free-pct-disk : 97 97 97
free-pct-memory : 96 96 96
fsync-max-sec : 0 0 0
high-water-disk-pct : 50 50 50
high-water-memory-pct : 60 60 60
hwm-breached : false false false
index-used-bytes-memory : 61225920 60300352 60573440
ldt-delete-success : 0 0 0
ldt-deletes : 0 0 0
ldt-enabled : true true true
ldt-err-bin-damaged : 0 0 0
ldt-err-bin-does-not-exist : 0 0 0
ldt-err-bin-exits : 0 0 0
ldt-err-capacity-exceeded : 0 0 0
ldt-err-createspec : 0 0 0
ldt-err-delete-fail : 0 0 0
ldt-err-filer : 0 0 0
ldt-err-input-too-large : 0 0 0
ldt-err-insert-fail : 0 0 0
ldt-err-internal : 0 0 0
ldt-err-item-notfound : 0 0 0
ldt-err-key : 0 0 0
ldt-err-ldt-not-enabled : 0 0 0
ldt-err-op-bintype-mismatch : 0 0 0
ldt-err-param : 0 0 0
ldt-err-search-fail : 0 0 0
ldt-err-subrec-internal : 0 0 0
ldt-err-subrec-not-found : 0 0 0
ldt-err-too-many-open-subrec : 0 0 0
ldt-err-toprec-internal : 0 0 0
ldt-err-toprec-notfound : 0 0 0
ldt-err-unique-key-violation : 0 0 0
ldt-err-unknown : 1 2 1
ldt-err-usermodule : 0 0 0
ldt-err-version-mismatch : 0 0 0
ldt-errors : 0 0 0
ldt-gc-cnt : 0 0 0
ldt-gc-io : 2145354 2803964 2309230
ldt-page-size : 8192 8192 8192
ldt-randomizer-retry : 0 0 0
ldt-read-success : 142 216 137
ldt-reads : 142 216 137
ldt-updates : 0 0 0
ldt-write-success : 427 812 444
ldt-writes : 427 812 444
master-objects : 468368 475980 478285
master-sub-objects : 5 5 11
max-ttl : 0 0 0
max-void-time : 183839675 183840547 183840135
max-write-cache : 67108864 67108864 67108864
memory-size : 16106127360 16106127360 16106127360
min-avail-pct : 5 5 5
non-expirable-objects : 467665 475224 477558
ns-forward-xdr-writes : false false false
nsup-cycle-duration : 0 0 0
nsup-cycle-sleep-pct : 0 0 0
obj-size-hist-max : 100 100 100
objects : 956637 942180 946449
post-write-queue : 0 0 0
prole-objects : 488269 466200 468164
prole-sub-objects : 13 8 0
read-consistency-level-override: off off off
repl-factor : 2 2 2
set-deleted-objects : 0 0 0
sets-enable-xdr : true true true
sindex-used-bytes-memory : 29201498 30202393 30350817
single-bin : false false false
stop-writes : false false false
stop-writes-pct : 90 90 90
sub-objects : 18 13 11
total-bytes-disk : 32212254720 32211206144 32212254720
total-bytes-memory : 16106127360 16106127360 16106127360
type : device device device
used-bytes-disk : 719153024 708126464 711290752
used-bytes-memory : 523925436 517237094 519617952
write-commit-level-override : off off off
writecache : 67108864 67108864 67108864
writethreads : 1 1 1
Also here’s the config on the dev cluster (same thing on all 3 nodes):
service {
user root
group root
paxos-single-replica-limit 1
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 4
transaction-threads-per-queue 3
proto-fd-max 15000
}
logging {
file /var/log/aerospike/aerospike.log {
context any warn
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002
mesh-address 172.25.21.101
mesh-port 3002
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace store_disk {
replication-factor 2
memory-size 15G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
device /dev/xvdb
scheduler-mode noop
write-block-size 128K
data-in-memory true
}
}
The setup is all relatively simple as you can see. With such a simple setup, not very much load, I’m not entirely sure why we’re seeing that behavior on all 3 nodes.
I feel like we’re missing something obvious.
Also we did have the service running under an aerospike user, we’ve been simplifying the config moving toward the default config bit by bit to see if the timeouts go away, and as you can see we ended up with a configuration that’s almost exactly the default.
HI Courtneyc,
can you update your aerospike config file according and let me know.
service {
address any
port 3000
access-address <private ip of each server>
network-interface-name eth0
}
heartbeat {
mode mesh
port 3002 # Heartbeat port for this node.
address <private IP>
# List one or more other nodes, one ip-address & port per line:
mesh-seed-address-port IP <3002> # IP and Port no.
mesh-seed-address-port IP <3002> # IP and Port no.
mesh-seed-address-port IP < 3002> # IP and Port no.
naoum
November 7, 2015, 5:02pm
6
@courtneyc
I had a similar with timeouts when accessing aws aerospike from a non-aws dev box - check out this - Reply with service IPs on interface where client asks