Hi,
We are using Aerospike server version 4.5.0.9
(some servers on 4.5.3.2
& 4.5.0.5
also) on 8 node bare metal cluster. We primarily use batch queries and a few other operations periodically like (write, scan, secondary index, operation etc). the current batch read TPS is approx 1M.
we often face timeout issues and writes and reads start to fail from Java & Go client and connections on servers start to shoot up (30-70k). We have a strict latency requirement of < 30ms for batch query (usually 5-6 keys per query to diff set).
What are the options I have to improve the performance? can I expect major performance improvement by upgrading to any latest server version (say 6+ or the latest 7+) if there is no other option in the current version? if so, what are the steps and risks involved in it?
Java client (v4.4.18)
ClientPolicy clientPolicy = new ClientPolicy();
clientPolicy.eventLoops = new NioEventLoops(eventPolicy, eventLoopGroup);
clientPolicy.maxConnsPerNode = 300;
BatchPolicy batchPolicy = new BatchPolicy();
batchPolicy.socketTimeout = 30; //30ms
batchPolicy.totalTimeout = 60; //60ms
batchPolicy.maxRetries = 0; // No retry
batchPolicy.timeoutDelay = 60; // an attempt to recover the socket in the background after socket read timeout to avoid closing the socket
batchPolicy.replica = Replica.MASTER_PROLES; // Spread load between master and replica
// long set with 1-2M record with secondery index
QueryPolicy queryPolicy = new QueryPolicy();
queryPolicy.maxConcurrentNodes = 1;
queryPolicy.recordQueueSize = 10000;
queryPolicy.socketTimeout = 370000; //370s
clientPolicy.queryPolicyDefault = queryPolicy;
clientPolicy.batchPolicyDefault = batchPolicy;
server config:
service {
user root
group root
paxos-single-replica-limit 1
pidfile /var/run/aerospike/asd.pid
transaction-threads-per-queue 4
proto-fd-max 100000
auto-pin cpu
}
network {
service {
address bond0
# address enp94s0f1
port 3000
access-address bond0
# alternate-access-address enp94s0f1
}
heartbeat {
mode mesh
port 3002
mesh-seed-address-port 10.x.x.x9 3002
mesh-seed-address-port 10.x.x.x0 3002
mesh-seed-address-port 10.x.x.x1 3002
mesh-seed-address-port 10.x.x.x2 3002
....
interval 150
timeout 10
}
namespace store {
replication-factor 2
memory-size 576G
default-ttl 0
high-water-disk-pct 70
high-water-memory-pct 95
stop-writes-pct 98
partition-tree-sprigs 8192
storage-engine device {
file /opt/aerospike/data/store.data
filesize 1000G
data-in-memory true
}
}