Intermittently network failure


#1

Hi,

We just installed Aerospike and configured a single node cluster.

We use Aerospike Community edition 3.12.1 on Ubuntu 16.04.

First of all we found that there are some problems with the new systemd service manager. It is not very reliable and sometime when we want to stop Aerospike service, it takes a lot of time and it does not shut it down. But the real problem is that this single node cluster fails most of the times when querying it. For example look at this normal aql command from command line:

$ aql -h 172.18.50.50
2017-05-15 17:08:10 WARN Failed to connect to seed 172.18.50.50 3000. AEROSPIKE_ERR_TIMEOUT , 172.18.50.50:3000
Error -1: Failed to connect

Error log is not showing us details of the error. We don’t know where to look, tried several configuration changes but still we have the same problem. Any help will be much appreciated!!

Thanks,


#2

What is your /etc/aerospike/aerospike.conf file look like? Are you running aql from a server separate from the single node cluster (@ 172.18.50.50?) Can you ping 172.18.50.50 from the server on which you are running aql? ie. Is this aerospike config issue or a network issue?


#3

Everything is being done from the same server where aerospike is running (172.18.50.50). I’m pretty sure this is an aerospike config issue because the network works without problem (even we are not going outside, that IP is from the same server that aql is being run). And yes the ping works as is the same server:

$ ping 172.18.50.50
PING 172.18.50.50 (172.18.50.50) 56(84) bytes of data.
64 bytes from 172.18.50.50: icmp_seq=1 ttl=64 time=0.021 ms
64 bytes from 172.18.50.50: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 172.18.50.50: icmp_seq=3 ttl=64 time=0.014 ms

This is the aerospike config file:

Aerospike database configuration file.

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        proto-fd-max 15000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address 172.18.50.50
                port 3000
                access-address 172.18.50.50
        }

        heartbeat {
                # mode multicast
                # multicast-group 239.1.99.222
                # port 9918
                mode mesh
                address 172.18.50.50
                port 3002 # Heartbeat port for this node.

                # List one or more other nodes, one ip-address & port per line:
                #asd1
                mesh-seed-address-port 172.18.50.9 3002

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.

                interval 150
                timeout 10
        }

        fabric {
                port 3001
                address 172.18.50.50
        }

        info {
                port 3003
                address 172.18.50.50
        }
}

namespace recommender {
       replication-factor 1
       memory-size 8G
       default-ttl 1m # 30 days, use 0 to never expire/evict.
       conflict-resolution-policy last-update-time

       storage-engine device {
              file /var/data/data.dat
              filesize 10G
              data-in-memory true # Store data in memory in addition to file.
      }
}

#4

Can you shar the ifconfig output of your server? And a tracerout to that IP?


#5

Also, can you try:

$aql

(without specifying -h option, use default localhost, 3000)

Is this an AWS EC2 instance?


#6

Hi, here are the command outputs:

ens1      Link encap:Ethernet  HWaddr 68:05:ca:1a:c4:eb  
          inet addr:`172.18.50.50`  Bcast:172.18.50.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7425708 errors:0 dropped:0 overruns:0 frame:0
          TX packets:869551 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:11055733597 (11.0 GB)  TX bytes:126693220 (126.6 MB)
          Interrupt:16 Memory:92e80000-92ea0000

$ traceroute 172.18.50.50
traceroute to 172.18.50.50 (172.18.50.50), 30 hops max, 60 byte packets
 1  172.18.50.50 (172.18.50.50)  0.021 ms  0.006 ms  0.004 ms

If I don’t specify -h option, we get the same error. Any other things I can provide to you?

Thanks!


#7

Hi!! Any help?


#8

@fdnieves , the community forums can provide help on a best effort level. If you require urgent production level support, you can setup a contract with Aerospike to help you. They are there 24x7 for any issues…

That being said, even though I’m not a member of the staff, I’m happy to help but please keep in mind that these troubleshooting steps and updates may not be extremely timely…

I’m curious now to know if Aerospike is binded to that port at all. Can you run ‘netstat -tunap’ and post the output? Maybe something else has the port? Also can you post the log file, so that we can see if anything is standing out? Maybe we can catch something you missed.


#9

Thanks for your answer Albot. This is just a side-project development so no need to hire production level support right now.

The aql intermittently failing was due to a wrong configuration. As this was a single node (and this aerospike only runs in this machine), we specified “mesh-seed-address-port 172.18.50.9 3002” to an unexistant server. Seems like it was messing the network a little bit. After commenting that line, aql logs every time without problems.

Thanks for the help!


#10

Wonderful news!! That is an interesting symptom for that misconfiguration though.