Query fails during migration

Hello everyone,

I’m new to Aerospike, and I’m currently using the Community Edition. My setup consists of three nodes with a Replication Factor of 2. I use Java Client with Default Query Policy and JDBC to retrieve data from Aerospike.

Everything works smoothly until one of my nodes undergoes migration. When a node goes down, I can still query data without any issues. However, upon restarting that node, I encounter problems with data queries. The client consistently returns a “Max Retries Exceeded” error, and the JDBC Client also produces errors.

Upon investigating the query function in the Aerospike Client Library (AerospikeClient.Java), I came across this line of code:

Node[] nodes = cluster.validateNodes();

I noticed that when a node goes down, the ‘nodes’ variable is reduced to 2, as the stopped node is removed. Consequently, the client successfully returns query results. However, when I start the node again, the ‘nodes’ variable increases back to 3, even though the node’s migration hasn’t completed yet. This leads to query failures.

Is this behavior expected, and is there something I can do to prevent my client from querying nodes that haven’t completed migration?

Thank you for reading.

First, enable java client logging to receive cluster tend errors: Logging | Developer

What is the java client version? What is the full query exception message?

The client is designed to handle queries when migrations are in progress. Sometimes, “partition unavailable” errors, do occur. Adding a longer QueryPolicy.sleepBetweenRetries helps.

Thank you, Brian, for the advice. I’ve added ‘sleepBetweenRetries,’ but it didn’t help. I’m using aerospike-client version 7.0.0, and the returned exception is

com.aerospike.client.AerospikeException: Error -11,11,0,30000,0,10: Max retries exceeded: Error -11,11,0,30000,0,10: Max retries exceeded

My query code:

        QueryPolicy queryPolicy = new QueryPolicy();
        Expression expr = Exp.build(Exp.eq(Exp.intBin("storeId"), Exp.val(47)));
        statement.setFilter(Filter.equal("productId", "pdt87"));
        queryPolicy.filterExp = expr;
        queryPolicy.replica = Replica.SEQUENCE;
        try (RecordSet recordSet = aerospikeClient.query(queryPolicy, statement)) {
            while (recordSet.next()) {
                recordList.add(recordSet.getRecord().bins);
            }
        }
        return recordList;

Client Logging:

2023-09-24 00:01:38 ICT DEBUG Update peers for node 97 10.1.24.97 3000
2023-09-24 00:01:38 ICT INFO Add node 97 10.1.24.97 3000
2023-09-24 00:01:38 ICT INFO Add node 99 10.1.24.99 3000
2023-09-24 00:01:38 ICT INFO Add node 98 10.1.24.98 3000
2023-09-24 00:01:38 ICT DEBUG Update peers for node 99 10.1.24.99 3000
2023-09-24 00:01:38 ICT DEBUG Update peers for node 98 10.1.24.98 3000
2023-09-24 00:01:38 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:01:38 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:01:38 ICT DEBUG Update partition map for node 98 10.1.24.98 3000
2023-09-24 00:01:38 ICT DEBUG Add seed 10.1.24.97 3000
2023-09-24 00:01:38 ICT DEBUG Add seed 10.1.24.99 3000
2023-09-24 00:01:38 ICT DEBUG Add seed 10.1.24.98 3000
2023-09-24 00:02:01 ICT WARN Node 98 10.1.24.98 3000 refresh failed: Error -8: java.io.EOFException
2023-09-24 00:02:01 ICT DEBUG Update peers for node 97 10.1.24.97 3000
2023-09-24 00:02:01 ICT DEBUG Update peers for node 99 10.1.24.99 3000
2023-09-24 00:02:04 ICT WARN Node 98 10.1.24.98 3000 refresh failed: Error -8: java.net.ConnectException: Connection refused: connect
2023-09-24 00:02:04 ICT DEBUG Update peers for node 97 10.1.24.97 3000
2023-09-24 00:02:04 ICT DEBUG Update peers for node 99 10.1.24.99 3000
2023-09-24 00:02:04 ICT INFO Remove node 98 10.1.24.98 3000
2023-09-24 00:02:05 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:05 ICT INFO Namespace dev replication factor changed from 3 to 2
2023-09-24 00:02:05 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:06 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:06 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:07 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:07 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:08 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:08 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:09 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:09 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:10 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:10 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:11 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:11 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:12 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:12 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:14 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:14 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:15 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:15 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:41 ICT DEBUG Update peers for node 97 10.1.24.97 3000
2023-09-24 00:02:41 ICT DEBUG Update peers for node 99 10.1.24.99 3000
2023-09-24 00:02:41 ICT INFO Add node 98 10.1.24.98 3000
2023-09-24 00:02:41 ICT DEBUG Update peers for node 98 10.1.24.98 3000
2023-09-24 00:02:41 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:41 ICT INFO Namespace dev replication factor changed from 2 to 3
2023-09-24 00:02:41 ICT DEBUG Update partition map for node 99 10.1.24.99 3000
2023-09-24 00:02:41 ICT DEBUG Update partition map for node 98 10.1.24.98 3000
2023-09-24 00:02:42 ICT DEBUG Update partition map for node 97 10.1.24.97 3000
2023-09-24 00:02:42 ICT DEBUG Update partition map for node 99 10.1.24.99 3000

My server log:

Sep 24 2023 00:02:37 GMT+0700: WARNING (nsup): (nsup.c:1266) {test1} cold start found no records below eviction void-time 433267867 - threshold bucket 83310, width 1 sec, count 3971 > target 3237 (0.5 pct)
Sep 24 2023 00:02:37 GMT+0700: WARNING (nsup): (nsup.c:1198) {test1} hwm breached but nothing to evict
Sep 24 2023 00:02:37 GMT+0700: WARNING (nsup): (nsup.c:699) {test1} breached eviction hwm (memory), memory sz:225985792 (41436416 + 0 + 184549376 + 0) hwm:128849018, disk sz:164717264 hwm:483183820
Sep 24 2023 00:02:37 GMT+0700: INFO (nsup): (nsup.c:1152) {test1} cold start building evict histogram ...
Sep 24 2023 00:02:37 GMT+0700: WARNING (nsup): (nsup.c:1266) {test1} cold start found no records below eviction void-time 433267867 - threshold bucket 83310, width 1 sec, count 3971 > target 3237 (0.5 pct)
Sep 24 2023 00:02:37 GMT+0700: WARNING (nsup): (nsup.c:1198) {test1} hwm breached but nothing to evict
Sep 24 2023 00:02:37 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2914) device /opt/aerospike-store/test1.dat: read complete: UNIQUE 647444 (REPLACED 0) (OLDER 345908) (EXPIRED 0) (EVICTED 7708) records
Sep 24 2023 00:02:37 GMT+0700: INFO (sindex): (populate.c:194) {test1} populating sindex by index scan
Sep 24 2023 00:02:38 GMT+0700: INFO (sindex): (populate.c:152) {test1} sindex population done
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:1046) {dev} loading free & defrag queues
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:983) /data/dev.dat init defrag profile: 0
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:1065) /data/dev.dat init wblocks: pristine-id 64 pristine 16320 free-q 0, defrag-q 0
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2333) {dev} starting device maintenance threads
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:1678) {dev} starting write threads
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:903) {dev} starting defrag threads
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:1046) {test1} loading free & defrag queues
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:983) /opt/aerospike-store/test1.dat init defrag profile: 0
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:1065) /opt/aerospike-store/test1.dat init wblocks: pristine-id 2006 pristine 366634 free-q 675, defrag-q 0
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2333) {test1} starting device maintenance threads
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:1678) {test1} starting write threads
Sep 24 2023 00:02:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:903) {test1} starting defrag threads
Sep 24 2023 00:02:38 GMT+0700: INFO (as): (as.c:382) initializing services...
Sep 24 2023 00:02:38 GMT+0700: INFO (service): (service.c:167) starting 20 service threads
Sep 24 2023 00:02:38 GMT+0700: INFO (hb): (hb.c:6793) added new mesh seed 10.1.24.97:3002
Sep 24 2023 00:02:38 GMT+0700: INFO (hb): (hb.c:6793) added new mesh seed 10.1.24.98:3002
Sep 24 2023 00:02:38 GMT+0700: INFO (fabric): (fabric.c:791) updated fabric published address list to {10.1.24.98:3001}
Sep 24 2023 00:02:38 GMT+0700: INFO (partition): (partition_balance.c:203) {dev} 4096 partitions: found 0 absent, 4096 stored
Sep 24 2023 00:02:38 GMT+0700: INFO (partition): (partition_balance.c:203) {test1} 4096 partitions: found 1423 absent, 2673 stored
Sep 24 2023 00:02:38 GMT+0700: INFO (hb): (hb.c:5523) updated heartbeat published address list to {10.1.24.98:3002}
Sep 24 2023 00:02:39 GMT+0700: INFO (batch): (batch.c:814) starting 8 batch-index-threads
Sep 24 2023 00:02:39 GMT+0700: INFO (health): (health.c:318) starting health monitor thread
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:416) starting 8 fabric send threads
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:430) starting 16 fabric rw channel recv threads
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:430) starting 4 fabric ctrl channel recv threads
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:430) starting 8 fabric bulk channel recv threads
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:430) starting 4 fabric meta channel recv threads
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:442) starting fabric accept thread
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:6978) initializing mesh heartbeat socket: 10.1.24.98:3002
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (socket.c:818) Started fabric endpoint 0.0.0.0:3001
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:7008) mtu of the network is 1500
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (socket.c:818) Started mesh heartbeat endpoint 10.1.24.98:3002
Sep 24 2023 00:02:39 GMT+0700: INFO (nsup): (nsup.c:197) starting namespace supervisor threads
Sep 24 2023 00:02:39 GMT+0700: INFO (service): (service.c:942) starting reaper thread
Sep 24 2023 00:02:39 GMT+0700: INFO (service): (socket.c:818) Started client endpoint 0.0.0.0:3000
Sep 24 2023 00:02:39 GMT+0700: INFO (service): (service.c:199) starting accept thread
Sep 24 2023 00:02:39 GMT+0700: INFO (info-port): (thr_info_port.c:297) starting info port thread
Sep 24 2023 00:02:39 GMT+0700: INFO (info-port): (socket.c:818) Started info endpoint 0.0.0.0:3003
Sep 24 2023 00:02:39 GMT+0700: INFO (as): (as.c:421) service ready: soon there will be cake!
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:6344) removing self seed entry host:10.1.24.98 port:3002
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:6832) removed mesh seed host:10.1.24.98 port 3002
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:8581) node arrived 97
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:2580) fabric: node 97 arrived
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:8581) node arrived 99
Sep 24 2023 00:02:39 GMT+0700: INFO (clustering): (clustering.c:5992) sent cluster join request to 99
Sep 24 2023 00:02:39 GMT+0700: INFO (fabric): (fabric.c:2580) fabric: node 99 arrived
Sep 24 2023 00:02:39 GMT+0700: INFO (hb): (hb.c:4376) found redundant connections to same node (98) - choosing at random
Sep 24 2023 00:02:40 GMT+0700: INFO (nsup): (nsup.c:1006) {test1} collecting ttl & object size info ...
Sep 24 2023 00:02:40 GMT+0700: INFO (nsup): (nsup.c:1081) {test1} ... done collecting ttl & object size info
Sep 24 2023 00:02:40 GMT+0700: INFO (clustering): (clustering.c:5797) applied new cluster key 62038cac02c5
Sep 24 2023 00:02:40 GMT+0700: INFO (clustering): (clustering.c:5799) applied new succession list 99 98 97
Sep 24 2023 00:02:40 GMT+0700: INFO (clustering): (clustering.c:5801) applied cluster size 3
Sep 24 2023 00:02:40 GMT+0700: INFO (exchange): (exchange.c:2345) data exchange started with cluster key 62038cac02c5
Sep 24 2023 00:02:40 GMT+0700: INFO (exchange): (exchange.c:2728) exchange-compatibility-id: self 11 cluster-min 0 -> 11 cluster-max 0 -> 11
Sep 24 2023 00:02:40 GMT+0700: INFO (exchange): (exchange.c:3296) received commit command from principal node 99
Sep 24 2023 00:02:40 GMT+0700: INFO (exchange): (exchange.c:3259) data exchange completed with cluster key 62038cac02c5
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:1008) {dev} replication factor is 3
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:981) {dev} rebalanced: expected-migrations (5436,4096,2680) fresh-partitions 0
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:1008) {test1} replication factor is 2
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:981) {test1} rebalanced: expected-migrations (2673,2673,2680) fresh-partitions 0
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:242)    cluster-clock: skew-ms 0
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:263)    system: total-cpu-pct 356 user-cpu-pct 133 kernel-cpu-pct 223 free-mem-kbytes 21969772 free-mem-pct 87 thp-mem-kbytes 40960
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:285)    process: cpu-pct 352 threads (17,71,52,52) heap-kbytes (3350380,3352860,3513344) heap-efficiency-pct 99.9
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:295)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:319)    fds: proto (75,76,1) heartbeat (2,4,2) fabric (56,56,0)
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:328)    heartbeat-received: self 3 foreign 131
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (3372840,9201772) ctrl (44104,44618) meta (12,4) rw (38,38)
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:413) {dev} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:470) {dev} migrations: remaining (2680,2521,2680) active (0,3,0) complete-pct 45.44
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:504) {dev} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:586) {dev} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:413) {test1} objects: all 647444 master 324189 prole 323255 non-replica 0
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:470) {test1} migrations: remaining (2293,1333,2680) active (8,0,0) complete-pct 32.17
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:504) {test1} memory-usage: total-bytes 226018560 index-bytes 41436416 set-index-bytes 0 sindex-bytes 184582144 used-pct 1.75
Sep 24 2023 00:02:49 GMT+0700: INFO (info): (ticker.c:586) {test1} device-usage: used-bytes 164717264 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:02:49 GMT+0700: INFO (hb): (hb.c:4376) (repeated:2) found redundant connections to same node (98) - choosing at random
Sep 24 2023 00:02:58 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2060) {test1} /opt/aerospike-store/test1.dat: used-bytes 164717264 free-wblocks 367309 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)
Sep 24 2023 00:02:58 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2060) {dev} /data/dev.dat: used-bytes 0 free-wblocks 16320 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:242)    cluster-clock: skew-ms 0
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:263)    system: total-cpu-pct 538 user-cpu-pct 185 kernel-cpu-pct 353 free-mem-kbytes 21971848 free-mem-pct 87 thp-mem-kbytes 40960
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:285)    process: cpu-pct 505 threads (14,71,52,52) heap-kbytes (3351988,3354800,3515392) heap-efficiency-pct 99.9
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:295)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:319)    fds: proto (76,77,1) heartbeat (2,4,2) fabric (56,56,0)
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:328)    heartbeat-received: self 3 foreign 264
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (7139956,7825239) ctrl (32529,53189) meta (0,0) rw (0,0)
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:413) {dev} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:470) {dev} migrations: remaining (2647,0,2680) active (2,0,0) complete-pct 72.23
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:504) {dev} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:586) {dev} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:413) {test1} objects: all 647444 master 324189 prole 323255 non-replica 0
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:470) {test1} migrations: remaining (1369,260,2680) active (6,16,0) complete-pct 69.53
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:504) {test1} memory-usage: total-bytes 226018560 index-bytes 41436416 set-index-bytes 0 sindex-bytes 184582144 used-pct 1.75
Sep 24 2023 00:02:59 GMT+0700: INFO (info): (ticker.c:586) {test1} device-usage: used-bytes 164717264 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:242)    cluster-clock: skew-ms 0
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:263)    system: total-cpu-pct 328 user-cpu-pct 139 kernel-cpu-pct 189 free-mem-kbytes 21973944 free-mem-pct 87 thp-mem-kbytes 40960
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:285)    process: cpu-pct 353 threads (14,71,52,52) heap-kbytes (3350785,3354416,3519488) heap-efficiency-pct 99.9
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:295)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:319)    fds: proto (76,77,1) heartbeat (2,4,2) fabric (56,56,0)
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:328)    heartbeat-received: self 3 foreign 397
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (3188002,2133695) ctrl (46000,25319) meta (0,0) rw (0,0)
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:413) {dev} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:477) {dev} migrations: complete
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:504) {dev} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:586) {dev} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:413) {test1} objects: all 647444 master 324189 prole 323255 non-replica 0
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:470) {test1} migrations: remaining (893,0,1786) active (8,0,0) complete-pct 83.30
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:504) {test1} memory-usage: total-bytes 226018560 index-bytes 41436416 set-index-bytes 0 sindex-bytes 184582144 used-pct 1.75
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (ticker.c:586) {test1} device-usage: used-bytes 164717264 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:09 GMT+0700: INFO (info): (thr_info.c:2246) Aerospike Telemetry Agent: Aerospike anonymous data collection is ACTIVE. For further information, see http://aerospike.com/aerospike-telemetry
Sep 24 2023 00:03:10 GMT+0700: INFO (info): (thr_info.c:2179) latencies command: unrecognized histogram: {dev}-query
Sep 24 2023 00:03:10 GMT+0700: INFO (info): (thr_info.c:2179) latencies command: unrecognized histogram: {test1}-query
Sep 24 2023 00:03:18 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2060) {test1} /opt/aerospike-store/test1.dat: used-bytes 164717264 free-wblocks 367309 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)
Sep 24 2023 00:03:18 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2060) {dev} /data/dev.dat: used-bytes 0 free-wblocks 16320 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:242)    cluster-clock: skew-ms 0
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:263)    system: total-cpu-pct 428 user-cpu-pct 158 kernel-cpu-pct 270 free-mem-kbytes 21974848 free-mem-pct 87 thp-mem-kbytes 40960
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:285)    process: cpu-pct 428 threads (16,71,52,52) heap-kbytes (3350720,3354300,3519488) heap-efficiency-pct 99.9
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:295)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:319)    fds: proto (76,79,3) heartbeat (2,4,2) fabric (56,56,0)
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:328)    heartbeat-received: self 3 foreign 531
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (5572363,626159) ctrl (15223,7045) meta (0,0) rw (0,0)
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:413) {dev} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:477) {dev} migrations: complete
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:504) {dev} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:586) {dev} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:413) {test1} objects: all 647444 master 324189 prole 323255 non-replica 0
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:470) {test1} migrations: remaining (82,0,165) active (7,0,1) complete-pct 98.47
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:504) {test1} memory-usage: total-bytes 226018560 index-bytes 41436416 set-index-bytes 0 sindex-bytes 184582144 used-pct 1.75
Sep 24 2023 00:03:19 GMT+0700: INFO (info): (ticker.c:586) {test1} device-usage: used-bytes 164717264 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:242)    cluster-clock: skew-ms 0
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:263)    system: total-cpu-pct 47 user-cpu-pct 20 kernel-cpu-pct 27 free-mem-kbytes 21974892 free-mem-pct 87 thp-mem-kbytes 40960
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:285)    process: cpu-pct 44 threads (10,71,52,52) heap-kbytes (3339699,3343404,3519488) heap-efficiency-pct 99.9
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:295)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:319)    fds: proto (76,79,3) heartbeat (2,4,2) fabric (56,56,0)
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:328)    heartbeat-received: self 3 foreign 664
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (579634,65930) ctrl (1510,705) meta (0,0) rw (0,0)
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:413) {dev} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:477) {dev} migrations: complete
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:504) {dev} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:586) {dev} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:413) {test1} objects: all 647444 master 324189 prole 323255 non-replica 0
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:477) {test1} migrations: complete
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:504) {test1} memory-usage: total-bytes 226018560 index-bytes 41436416 set-index-bytes 0 sindex-bytes 184582144 used-pct 1.75
Sep 24 2023 00:03:29 GMT+0700: INFO (info): (ticker.c:586) {test1} device-usage: used-bytes 164717264 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:38 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2060) {test1} /opt/aerospike-store/test1.dat: used-bytes 164717264 free-wblocks 367309 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)
Sep 24 2023 00:03:39 GMT+0700: INFO (drv_ssd): (drv_ssd.c:2060) {dev} /data/dev.dat: used-bytes 0 free-wblocks 16320 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:242)    cluster-clock: skew-ms 0
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:263)    system: total-cpu-pct 14 user-cpu-pct 5 kernel-cpu-pct 9 free-mem-kbytes 21972596 free-mem-pct 87 thp-mem-kbytes 40960
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:285)    process: cpu-pct 4 threads (10,71,52,52) heap-kbytes (3339711,3343404,3519488) heap-efficiency-pct 99.9
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:295)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 long-queries 0
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:319)    fds: proto (76,79,3) heartbeat (2,4,2) fabric (56,56,0)
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:328)    heartbeat-received: self 3 foreign 797
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (0,0)
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:413) {dev} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:477) {dev} migrations: complete
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:504) {dev} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 used-pct 0.00
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:586) {dev} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:413) {test1} objects: all 647444 master 324189 prole 323255 non-replica 0
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:477) {test1} migrations: complete
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:504) {test1} memory-usage: total-bytes 226018560 index-bytes 41436416 set-index-bytes 0 sindex-bytes 184582144 used-pct 1.75
Sep 24 2023 00:03:39 GMT+0700: INFO (info): (ticker.c:586) {test1} device-usage: used-bytes 164717264 avail-pct 99 cache-read-pct 0.00
Sep 24 2023 00:03:49 GMT+0700: INFO (info): (ticker.c:168) NODE-ID 98 CLUSTER-SIZE 3 CLUSTER-NAME inmem-cluster

I notice that when that log line appear, my query failed

Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:1008) {dev} replication factor is 3
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:981) {dev} rebalanced: expected-migrations (5436,4096,2680) fresh-partitions 0
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:1008) {test1} replication factor is 2
Sep 24 2023 00:02:40 GMT+0700: INFO (partition): (partition_balance.c:981) {test1} rebalanced: expected-migrations (2673,2673,2680) fresh-partitions 0

My node configuration:

service {
        cluster-name inmem-cluster
        node-id 98
        service-threads 20 # Should be 5 times number of vCPUs for 4.7+
        proto-fd-max 15000
        enable-health-check true
        log-local-time true
        migrate-max-num-incoming 16
        migrate-threads 8
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}
network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                address 10.1.24.98
                port 3002
                mesh-seed-address-port 10.1.24.97  3002
                mesh-seed-address-port 10.1.24.98  3002
                interval 150
                timeout 10
        }

        fabric {
                port 3001
                channel-bulk-fds 4
                channel-bulk-recv-threads 8
        }

        info {
                port 3003
        }
}

namespace dev {
        default-ttl 1D
        conflict-resolution-policy last-update-time
        replication-factor 3
        memory-size 1G
        nsup-period 300 # The interval at which the main expiration/eviction thread,the value is in seconds.
        nsup-threads 8
        high-water-disk-pct 1
        high-water-memory-pct 1
        stop-writes-sys-memory-pct 90
        storage-engine device {
                file /data/dev.dat
                filesize 2G
                write-block-size 128K
                max-write-cache 2048M
                data-in-memory false
                defrag-lwm-pct 1
                defrag-queue-min 10
 }
}

namespace test1 {
        default-ttl 1D
        conflict-resolution-policy last-update-time
        replication-factor 2
        memory-size 12G
        nsup-period 300 # The interval at which the main expiration/eviction thread,the value is in seconds.
        nsup-threads 8
        high-water-disk-pct 1
        high-water-memory-pct 1
        stop-writes-sys-memory-pct 90
        storage-engine device {
                file /opt/aerospike-store/test1.dat
                filesize 45G
                write-block-size 128K
                max-write-cache 2048M
                data-in-memory false
                defrag-lwm-pct 1
        }
}

I’m using a Secondary Index Query to retrieve data, and I’ve read the documentation about SI queries. It seems quite complex to obtain result during migration or rebalance, and I’m unsure if the client can handle this.

When “Max retries exceeded” occurs on a query without including sub-exceptions, it means “partition unavailable” error(s) occurred on each retry. The next release of the java client will throw an exception with error code PARTITION_UNAVAILABLE when this occurs.

As a test, set “queryPolicy.sleepBetweenRetries = 2000;” (2 seconds) to see if the cluster recovers by this time period.

1 Like

My team are facing this issue right now with the same scenario with secondary index query with spark stream services. Every time that any node in cluster restart service or node and then it start migrate some services will random error, fail.

I will let you know the solution while aerospike support provide the solution to fix it.

1 Like

When using Spark, make sure to leverage the Aerospike Connect for Spark Product. When a node is restarted and partitions move around during the subsequent migrations, the client may hit a node that has just given up ownership of that partition and the queryPolicy.sleepBetweenRetries should be set to a value high enough to allow the client to tend all the nodes in order to know which node has the partition that has moved and retry against it.

Having said that, if the Spark Executors directly instantiate Aerospike Java clients, this could cause too many of those instantiated simultaneously and get in the way of proper tending of the nodes in the cluster which would cause errors during migrations as the clients could then have a much delayed view of changes in the cluster.