Error in delay connects after system ready message

benchmark

#1

Could you supply your configuration?


Node crash in AWS using the Aerospike AMI
#2

Hi kporter,

My config file listed below. we are using Mesh for multi node aerospike cluster.

network {
    service {
        address any
        port 3000
        access-address x.x.x.x
                network-interface-name eth0
    }

    heartbeat {
        #mode multicast
        #address 239.1.99.222
        #port 9918
         mode mesh
                 port 3002
                 address x.x.x.x
                 mesh-seed-address-port x.x.x.x 3002
                 mesh-seed-address-port x.x.x.x 3002

        interval 200
        timeout 20
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

To fix issue I opened all port on inter-cluster communication for Aerospike cluster as well as increased timeout value. Now everything is working fine.

I figured out one issue for benchmarking without VPC.

/opt/aerospike/bin$  sudo ./afterburner.sh 
Configuring: core: 0 ETH: eth0  IRQ:  AFFINITY: 1
./afterburner.sh: line 57: /proc/irq//smp_affinity: No such file or directory

lsb_release -a
Distributor ID:    Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:    14.04
Codename:    trusty

/opt/aerospike/bin$ cat /proc/cpuinfo 
processor    : 0-3 (four core)
vendor_id    : GenuineIntel
cpu family    : 6
model        : 62
model name    : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode    : 0x428
cpu MHz        : 2494.080
cache size    : 25600 KB
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms

#3

Ah, good to hear you were able to get everything under control!


#4

@kporter

We are also getting lots of timeout error in our production setup. We are using java as a client and using storage engine:memory. We have 2 nodes in AWS for aerospike. Its happening at the peak traffic. We have tried to pull down one of the node as well but it did not work. As soon as we restart, it remains fine for 10 mins and again start giving timeout. Please help us out. We are facing this problem on production. We have client policy timeout as 5sec and 1000 maxThreads. Many object are of size around 500Kb to 1MB+. We are facing following error in java client :

timeout=5000,iterations=2,failedConn

Please find below the configuration file which we are using : #Aerospike database configuration file.

service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 2
transaction-queues 2
transaction-threads-per-queue 4
proto-fd-max 15000
}

logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
	context any debug
}
}

network {
service {
	address any
	port 3000
}

      heartbeat {
             mode mesh
            port 3002 # Heartbeat port for this node.

            # List one or more other nodes, one ip-address & port per line:
            mesh-seed-address-port 10.0.1.183 3002
            mesh-seed-address-port 10.0.2.183 3002
           #mesh-seed-address-port 10.10.10.12 3002
           #mesh-seed-address-port 10.10.10.13 3002
           #mesh-seed-address-port 10.10.10.14 3002

            interval 250
            timeout 30

}

fabric {
	port 3001
}

info {
	port 3003
}
}

  namespace aerospike {
replication-factor 1
memory-size 13G
    #ldt-enabled true
#transaction-pending-limit 50
#default-ttl 30d # 30 days, use 0 to never expire/evict.

storage-engine memory

# To use file storage backing, comment out the line above and use the
# following lines instead.
    #storage-engine device {
    #file /aerospike/data/bar.dat
    #filesize 10G
    #data-in-memory true # Store data in memory in addition to file.
    #}
}

#5

You should increase the number of service-threads. For in memory only, most transactions are handled by the service threads. Typically we recommend service-threads = number of cores.

Your transaction queues and threads per queue is also very low, but has less impact on your configuration. Normally we recommend transaction-queues = number of cores and transaction-threads-per-queue = 3.

If these do not help, seeing the output of the following ran shortly after such an event may help me further understand the problem.

sar -d -n DEV

#6

@kporter

Thanks for your response. The actual error we are getting is as following :

com.aerospike.client.AerospikeException$Timeout: Client timeout: timeout=0 iterations=2 failedNodes=0 failedConns=0 lastNode=BB99387095DCF06 10.0.2.183:3000

at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:129)

at com.aerospike.client.AerospikeClient.get(AerospikeClient.java:496)

Also we are using 2 cores and CPU usage was almost nil at the time of error. The timeout was coming very frequently at peak traffic.

Please advice.


#7

Please see my previous response. I believe all of your threads (2) are busy handling other transactions and are unable to keep up with the incoming requests.

The sar command I provided was to look at your network and disk utilizations which normally saturate well before the CPU is able to saturate.


#8

Missed the 2 core servers bit :laughing:. Well in that case I would probably need the output of sar. Are these Amazon instances? You could still try increasing the service-threads and see if that helps at all.


#9

@kporter

Please find below the sar output on that machine :

Linux 3.14.35-28.38.amzn1.x86_64 (aero-10-0-1-23) 10/07/2015 x86_64 (2 CPU)

Also we are using ec2 r3.large instance type in vpc.

Please let me know if you need more details.


#10

Seems to be missing, I would have expected something like:

sar -d -n DEV
Linux 2.6.32-279.19.1.el6.x86_64 (u23) 	10/06/2015 	_x86_64_	(4 CPU)

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:10:01 AM   dev8-16      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM    dev8-0      0.56      0.00      6.59     11.75      0.00      8.08      7.55      0.42
12:10:01 AM   dev8-32      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM   dev8-16      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM    dev8-0      0.55      0.00      6.40     11.70      0.00      7.99      7.67      0.42
...

12:00:01 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
12:10:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth1      3.72      1.11      0.22      0.08      0.00      0.00      0.01
12:10:01 AM      eth2      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM      eth1      3.78      1.13      0.22      0.08      0.00      0.00      0.01
...

#11

@kporter,

Actually we have removed the instance from our production environment and rolled back to memcached again for the time being. We are trying to reproduce this error with production load on our staging environment. If we get the error again, than i will be able to provide you the sar output.

Can you give us some possible solutions to solve it? Should we increase the number of cores and service threads and try with a bigger instance like (r3.2xlarge with 4 cores)? We want to achieve 5000 TPS using some large values of around 1MB in our production.

Please advice.


#12

For 5k tps of 1MB objects you would need 10Gb networking. The instances you are discussing bench(mark) up to 1 Gb. http://blog.flux7.com/blogs/benchmarks/benchmarking-network-performance-of-m1-and-m3-instances-using-iperf-tool

There are instances that support 10Gb: http://www.ec2instances.info/?filter=10%20Gigabit