Hi all,
After update server from 3.7.3 to 3.8.1 we began to receive “Error Code 9: Timeout” during read and write operations. The error is not often, but regular: few per ~5 minutes, moreover not uniformly (“bursts”).
We found only in AS log, not many, sometimes like:
trans_in_progress: wr 1 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 4547 : fds - proto (3232, 91977, 88745) : hb (0, 0, 0) : fab (72, 1033, 961)
But we do not know whether it was before.
AS installed on Centos 6.7. We use C# client 3.2.2. Load: ~ 45k read and 35k write per 5 servers. Errors increases dramatically if increase the load on a couple of thousand.
What we need to do to solve the problem?
Experimentally we found out that the problem begin with version 3.7.5. Maybe it’s in AER-4754? Unfortunately we could not find details about this improvement.
“Error Code 9: Timeout” is a timeout issued by the server rather than the client. In the past, the client would raise a timeout before receiving the message from the server, however; recent changes may have increased the likelihood that the client receives the server’s timeout before raising a timeout of its own.
When taking into account client side timeouts, have the number of timeout increased?
Could you also share your aerospike.conf?
In 3.7.3 we had no timeouts at all.
Our config (sorry, but I hidden the ports and addresses):
service {
user aerospike
group aerospike
paxos-single-replica-limit 1
pidfile /var/run/aerospike/asd.pid
service-threads 12
transaction-queues 12
transaction-threads-per-queue 4
proto-fd-max 65535
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port XXXX
access-address XXX.XXX.XXX.XXX
reuse-address
}
heartbeat {
mode multicast
address XXX.XXX.XXX.XXX
port XXXX
interface-address XXX.XXX.XXX.XXX
interval 150
timeout 10
}
fabric {
port XXXX
}
info {
port XXXX
}
}
And we have 14 namespaces, like:
namespace NS {
replication-factor 1
high-water-memory-pct 99
high-water-disk-pct 99
stop-writes-pct 99
memory-size 58G
default-ttl 0
storage-engine device {
file /opt/aerospike/data/ns.data
file /opt/aerospike/data2/ns.data
filesize 90G
data-in-memory true
defrag-lwm-pct 50
defrag-startup-minimum 1
}
}
These have been adjusted above safe values. Could you provide the output of:
asadm -e "info"
From your first log line you provided dq 4547
says that NSUP is scheduling deletes, this is either the result of expiration or eviction. Eviction would be interesting since that would indicate that either the disk, memory, or both are nearly full based on your high-water-{memory,disk} settings.
Further questions:
- What is the transaction timeout policy?
- Where there ongoing migrations happening while these timeouts were present? If so and if migrations have completed, are the timeouts still happening.
- Are all 14 namespaces replication factor 1?
These have been adjusted above safe values.
Yes, so has left…
What is the transaction timeout policy?
Default.
Where there ongoing migrations happening while these timeouts were present? If so and if migrations have completed, are the timeouts still happening.
Unfortunately I can’t say. I have already downgrade server because we need a smooth functioning.
Are all 14 namespaces replication factor 1?
1 ns have factor 2.
asadm -e “info”
Log is so big. Please see DropMeFiles – free one-click file sharing service
Is any news about the problem?
We haven’t yet observed this problem and have only one reference to it. You have mentioned downgrading has resolved your issue, at this time I cannot offer you a better solution. If you have a support contract, I recommend starting a case at https://support.aerospike.com/; they will have the resources to dive much deeper into the problem you are experiencing.