Error Code 9: Timeout after update 3.7.3 to 3.8.1


#1

Hi all,

After update server from 3.7.3 to 3.8.1 we began to receive “Error Code 9: Timeout” during read and write operations. The error is not often, but regular: few per ~5 minutes, moreover not uniformly (“bursts”).

We found only in AS log, not many, sometimes like:

trans_in_progress: wr 1 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 4547 : fds - proto (3232, 91977, 88745) : hb (0, 0, 0) : fab (72, 1033, 961)

But we do not know whether it was before.

AS installed on Centos 6.7. We use C# client 3.2.2. Load: ~ 45k read and 35k write per 5 servers. Errors increases dramatically if increase the load on a couple of thousand.

What we need to do to solve the problem?


#2

Experimentally we found out that the problem begin with version 3.7.5. Maybe it’s in AER-4754? Unfortunately we could not find details about this improvement.


#3

“Error Code 9: Timeout” is a timeout issued by the server rather than the client. In the past, the client would raise a timeout before receiving the message from the server, however; recent changes may have increased the likelihood that the client receives the server’s timeout before raising a timeout of its own.

When taking into account client side timeouts, have the number of timeout increased?

Could you also share your aerospike.conf?


#4

In 3.7.3 we had no timeouts at all.

Our config (sorry, but I hidden the ports and addresses):

service {
        user aerospike
        group aerospike
        paxos-single-replica-limit 1
        pidfile /var/run/aerospike/asd.pid
        service-threads 12
        transaction-queues 12
        transaction-threads-per-queue 4
        proto-fd-max 65535
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port XXXX
                access-address XXX.XXX.XXX.XXX
                reuse-address
        }

        heartbeat {
                mode multicast
                address XXX.XXX.XXX.XXX
                port XXXX
                interface-address XXX.XXX.XXX.XXX

                interval 150
                timeout 10
        }

        fabric {
                port XXXX
        }

        info {
                port XXXX
        }
}

And we have 14 namespaces, like:

namespace NS {
       replication-factor 1
       high-water-memory-pct 99
       high-water-disk-pct 99
       stop-writes-pct 99
       memory-size 58G
       default-ttl 0

       storage-engine device {
               file /opt/aerospike/data/ns.data
               file /opt/aerospike/data2/ns.data
               filesize 90G
               data-in-memory true

               defrag-lwm-pct 50
               defrag-startup-minimum 1
       }
}

#5

These have been adjusted above safe values. Could you provide the output of:

asadm -e "info"

From your first log line you provided dq 4547 says that NSUP is scheduling deletes, this is either the result of expiration or eviction. Eviction would be interesting since that would indicate that either the disk, memory, or both are nearly full based on your high-water-{memory,disk} settings.

Further questions:

  1. What is the transaction timeout policy?
  2. Where there ongoing migrations happening while these timeouts were present? If so and if migrations have completed, are the timeouts still happening.
  3. Are all 14 namespaces replication factor 1?

#6

These have been adjusted above safe values.

Yes, so has left…

What is the transaction timeout policy?

Default.

Where there ongoing migrations happening while these timeouts were present? If so and if migrations have completed, are the timeouts still happening.

Unfortunately I can’t say. I have already downgrade server because we need a smooth functioning.

Are all 14 namespaces replication factor 1?

1 ns have factor 2.

asadm -e “info”

Log is so big. Please see https://dropmefiles.com/SYxzv


#8

Is any news about the problem?


#9

We haven’t yet observed this problem and have only one reference to it. You have mentioned downgrading has resolved your issue, at this time I cannot offer you a better solution. If you have a support contract, I recommend starting a case at https://support.aerospike.com/; they will have the resources to dive much deeper into the problem you are experiencing.