Intermittently network failure

fdnieves · May 15, 2017, 10:17pm

Hi,

We just installed Aerospike and configured a single node cluster.

We use Aerospike Community edition 3.12.1 on Ubuntu 16.04.

First of all we found that there are some problems with the new systemd service manager. It is not very reliable and sometime when we want to stop Aerospike service, it takes a lot of time and it does not shut it down. But the real problem is that this single node cluster fails most of the times when querying it. For example look at this normal aql command from command line:

$ aql -h 172.18.50.50
2017-05-15 17:08:10 WARN Failed to connect to seed 172.18.50.50 3000. AEROSPIKE_ERR_TIMEOUT , 172.18.50.50:3000
Error -1: Failed to connect

Error log is not showing us details of the error. We don’t know where to look, tried several configuration changes but still we have the same problem. Any help will be much appreciated!!

Thanks,

pgupta · May 15, 2017, 10:30pm

What is your /etc/aerospike/aerospike.conf file look like? Are you running aql from a server separate from the single node cluster (@ 172.18.50.50?) Can you ping 172.18.50.50 from the server on which you are running aql? ie. Is this aerospike config issue or a network issue?

fdnieves · May 15, 2017, 10:51pm

Everything is being done from the same server where aerospike is running (172.18.50.50). I’m pretty sure this is an aerospike config issue because the network works without problem (even we are not going outside, that IP is from the same server that aql is being run). And yes the ping works as is the same server:

$ ping 172.18.50.50
PING 172.18.50.50 (172.18.50.50) 56(84) bytes of data.
64 bytes from 172.18.50.50: icmp_seq=1 ttl=64 time=0.021 ms
64 bytes from 172.18.50.50: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 172.18.50.50: icmp_seq=3 ttl=64 time=0.014 ms

This is the aerospike config file:

Aerospike database configuration file.

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        proto-fd-max 15000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address 172.18.50.50
                port 3000
                access-address 172.18.50.50
        }

        heartbeat {
                # mode multicast
                # multicast-group 239.1.99.222
                # port 9918
                mode mesh
                address 172.18.50.50
                port 3002 # Heartbeat port for this node.

                # List one or more other nodes, one ip-address & port per line:
                #asd1
                mesh-seed-address-port 172.18.50.9 3002

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.

                interval 150
                timeout 10
        }

        fabric {
                port 3001
                address 172.18.50.50
        }

        info {
                port 3003
                address 172.18.50.50
        }
}

namespace recommender {
       replication-factor 1
       memory-size 8G
       default-ttl 1m # 30 days, use 0 to never expire/evict.
       conflict-resolution-policy last-update-time

       storage-engine device {
              file /var/data/data.dat
              filesize 10G
              data-in-memory true # Store data in memory in addition to file.
      }
}

Albot · May 15, 2017, 11:03pm

Can you shar the ifconfig output of your server? And a tracerout to that IP?

pgupta · May 15, 2017, 11:15pm

Also, can you try:

$aql

(without specifying -h option, use default localhost, 3000)

Is this an AWS EC2 instance?

fdnieves · May 16, 2017, 1:24pm

Hi, here are the command outputs:

ens1      Link encap:Ethernet  HWaddr 68:05:ca:1a:c4:eb  
          inet addr:`172.18.50.50`  Bcast:172.18.50.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7425708 errors:0 dropped:0 overruns:0 frame:0
          TX packets:869551 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:11055733597 (11.0 GB)  TX bytes:126693220 (126.6 MB)
          Interrupt:16 Memory:92e80000-92ea0000

$ traceroute 172.18.50.50
traceroute to 172.18.50.50 (172.18.50.50), 30 hops max, 60 byte packets
 1  172.18.50.50 (172.18.50.50)  0.021 ms  0.006 ms  0.004 ms

If I don’t specify -h option, we get the same error. Any other things I can provide to you?

Thanks!

fdnieves · May 17, 2017, 7:38pm

Hi!! Any help?

Albot · May 18, 2017, 1:49am

@fdnieves , the community forums can provide help on a best effort level. If you require urgent production level support, you can setup a contract with Aerospike to help you. They are there 24x7 for any issues…

That being said, even though I’m not a member of the staff, I’m happy to help but please keep in mind that these troubleshooting steps and updates may not be extremely timely…

I’m curious now to know if Aerospike is binded to that port at all. Can you run ‘netstat -tunap’ and post the output? Maybe something else has the port? Also can you post the log file, so that we can see if anything is standing out? Maybe we can catch something you missed.

fdnieves · May 18, 2017, 7:42pm

Thanks for your answer Albot. This is just a side-project development so no need to hire production level support right now.

The aql intermittently failing was due to a wrong configuration. As this was a single node (and this aerospike only runs in this machine), we specified “mesh-seed-address-port 172.18.50.9 3002” to an unexistant server. Seems like it was messing the network a little bit. After commenting that line, aql logs every time without problems.

Thanks for the help!

Albot · May 20, 2017, 12:28am

Wonderful news!! That is an interesting symptom for that misconfiguration though.

Topic		Replies	Views
Aerospike Cluster Automatically Errors Node.js Client	3	3726	January 18, 2016
AEROSPIKE_ERR_CLIENT Node BB9A0AEAE005452 127.0.0.1:3000 is not yet fully initialized Installation	5	3733	July 3, 2018
Aql client overwhelmed by "WARN AEROSPIKE_ERR_TIMEOUT" AQL	29	6142	February 8, 2017
Inconsistent result if fetching a key when 1 node crashed on 4 node Aerospike cluster (3.9.0) AQL	31	3970	October 14, 2016
Why is the aerospike cluster config is not working Installation	2	684	August 14, 2022

Intermittently network failure

Aerospike database configuration file.

Related topics