Error Code 9: Timeout after update 3.7.3 to 3.8.1

tx_user · April 26, 2016, 2:48pm

Hi all,

After update server from 3.7.3 to 3.8.1 we began to receive “Error Code 9: Timeout” during read and write operations. The error is not often, but regular: few per ~5 minutes, moreover not uniformly (“bursts”).

We found only in AS log, not many, sometimes like:

trans_in_progress: wr 1 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 4547 : fds - proto (3232, 91977, 88745) : hb (0, 0, 0) : fab (72, 1033, 961)

But we do not know whether it was before.

AS installed on Centos 6.7. We use C# client 3.2.2. Load: ~ 45k read and 35k write per 5 servers. Errors increases dramatically if increase the load on a couple of thousand.

What we need to do to solve the problem?

tx_user · April 27, 2016, 1:13pm

Experimentally we found out that the problem begin with version 3.7.5. Maybe it’s in AER-4754? Unfortunately we could not find details about this improvement.

kporter · April 28, 2016, 1:03am

“Error Code 9: Timeout” is a timeout issued by the server rather than the client. In the past, the client would raise a timeout before receiving the message from the server, however; recent changes may have increased the likelihood that the client receives the server’s timeout before raising a timeout of its own.

When taking into account client side timeouts, have the number of timeout increased?

Could you also share your aerospike.conf?

tx_user · April 28, 2016, 5:38am

In 3.7.3 we had no timeouts at all.

Our config (sorry, but I hidden the ports and addresses):

service {
        user aerospike
        group aerospike
        paxos-single-replica-limit 1
        pidfile /var/run/aerospike/asd.pid
        service-threads 12
        transaction-queues 12
        transaction-threads-per-queue 4
        proto-fd-max 65535
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port XXXX
                access-address XXX.XXX.XXX.XXX
                reuse-address
        }

        heartbeat {
                mode multicast
                address XXX.XXX.XXX.XXX
                port XXXX
                interface-address XXX.XXX.XXX.XXX

                interval 150
                timeout 10
        }

        fabric {
                port XXXX
        }

        info {
                port XXXX
        }
}

And we have 14 namespaces, like:

namespace NS {
       replication-factor 1
       high-water-memory-pct 99
       high-water-disk-pct 99
       stop-writes-pct 99
       memory-size 58G
       default-ttl 0

       storage-engine device {
               file /opt/aerospike/data/ns.data
               file /opt/aerospike/data2/ns.data
               filesize 90G
               data-in-memory true

               defrag-lwm-pct 50
               defrag-startup-minimum 1
       }
}

kporter · April 28, 2016, 5:34pm

These have been adjusted above safe values. Could you provide the output of:

asadm -e "info"

From your first log line you provided dq 4547 says that NSUP is scheduling deletes, this is either the result of expiration or eviction. Eviction would be interesting since that would indicate that either the disk, memory, or both are nearly full based on your high-water-{memory,disk} settings.

Further questions:

What is the transaction timeout policy?
Where there ongoing migrations happening while these timeouts were present? If so and if migrations have completed, are the timeouts still happening.
Are all 14 namespaces replication factor 1?

tx_user · April 28, 2016, 8:06pm

These have been adjusted above safe values.

Yes, so has left…

What is the transaction timeout policy?

Default.

Where there ongoing migrations happening while these timeouts were present? If so and if migrations have completed, are the timeouts still happening.

Unfortunately I can’t say. I have already downgrade server because we need a smooth functioning.

Are all 14 namespaces replication factor 1?

1 ns have factor 2.

asadm -e “info”

Log is so big. Please see DropMeFiles – free one-click file sharing service

tx_user · May 20, 2016, 6:48am

Is any news about the problem?

kporter · May 20, 2016, 11:59pm

We haven’t yet observed this problem and have only one reference to it. You have mentioned downgrading has resolved your issue, at this time I cannot offer you a better solution. If you have a support contract, I recommend starting a case at https://support.aerospike.com/; they will have the resources to dive much deeper into the problem you are experiencing.

Topic		Replies	Views
Many errors 9 (AS_ERR_TIMEOUT) error , php7	4	1194	October 13, 2021
AerospikeException Timeout Operations query , java , index	0	1927	July 1, 2016
Aerospike write error 9 Tuning	3	1833	April 20, 2017
Observing steady increase timeouts from aerospike server error	0	1735	March 10, 2016
Aerospike_err_timeout - c-3.11.0.2 Tuning	1	1155	October 14, 2018

Error Code 9: Timeout after update 3.7.3 to 3.8.1

Related topics