Aerospike_err_timeout - c-3.11.0.2


#1

Team,

We are facing connection timeout AEROSPIKE_ERR_TIMEOUT on our applications connecting to AS. It was for 10mins. Any thing to see what could have caused this issue? Where do I start off?

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 16
        transaction-queues 16
        transaction-threads-per-queue 4
	migrate-threads 32
        migrate-xmit-priority 0
        migrate-read-priority 0
        migrate-xmit-hwm 150
        migrate-xmit-lwm 100
        proto-fd-max 15000
	proto-fd-idle-ms 7200000
}
network {
    	service {
            	address any
            	port 3000
    	}

    	heartbeat {
            	mode mesh
            	port 3002
		            mesh-seed-address-port merc1.regn.com. 3002
		            mesh-seed-address-port merc2.regn.com. 3002
		            mesh-seed-address-port merc3.regn.com. 3002
                            mesh-seed-address-port merc4.regn.com. 3002
                            mesh-seed-address-port merc5.regn.com. 3002
                            mesh-seed-address-port merc6.regn.com. 3002

            	interval 2500
            	timeout 3
    	}
       fabric {
           port 3001
        }
        info {
           port 3003
        }
}
logging {
    file /var/log/aerospike/aerospike.log {
        context any info
        context migrate debug
    }
 }

namespace mercsedan {
     memory-size 60G
     replication-factor 2
     default-ttl 30d
     high-water-memory-pct 70
     stop-writes-pct 90
     storage-engine device {
         file /var/lib/aerospike/mercsedan.dat
         filesize 100G
         data-in-memory true
	 write-block-size 128K
         defrag-lwm-pct 50
         defrag-startup-minimum 10
     }
 }

#2

A good first step is to open a case with support if you have enterprise support, they usually respond within the hour. If you don’t have support… what have you checked so far? Anything show up in grep -v INFO /var/log/aerospike/aerospike.log? Any latency using asloglatency? Was the cluster stable during the time grep -i size /var/log/aerospike/aerospike.log? How about resources on the node during the time of the issue, was sar -r, sar -q, sar -n DEV, sar -d -p looking normal? Which versions of server and client are you using? Why do you have migrate-threads set to 32 in your config? Do you have nagios statistics collection setup or anywhere you could see if hotkey or error stats were growing? Are you sure the timeout was only to Aerospike and the clients were suffering timeouts to anything else? Do you have the logging interface setup on the Aerospike client side?