Aerospike not Starting (SIGABRT) (AER-3946) [Released] [Resolved]

Hello,

Were running successfull 2 Aerospike (within Docker) on the same Host (different ports --net=host). If we start a 3rd Aerospike node it does not start heres the Trace:

aerorestore_1 | Jul 01 2015 18:12:44 GMT: INFO (tsvc): (thr_tsvc.c::916) shared queues: 4 queues with 4 threads each
aerorestore_1 | Jul 01 2015 18:12:44 GMT: INFO (hb): (hb.c::2459) heartbeat socket initialization
aerorestore_1 | Jul 01 2015 18:12:44 GMT: INFO (hb): (hb.c::2473) initializing mesh heartbeat socket : 0.0.0.0:5002
aerorestore_1 | Jul 01 2015 18:12:44 GMT: INFO (info): (thr_info.c::5276)  static external network definition
aerorestore_1 | Jul 01 2015 18:12:44 GMT: CRITICAL (info): (thr_info.c:info_interfaces_static_fn:5290) external address: is not matching with any of service addresses:(null)
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Enterprise Edition build 3.5.14
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: found 8 frames
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x54) [0x4894d3]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x321e0) [0x7f2434a911e0]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f2434a91165]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f2434a943e0]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x229) [0x51d3db]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(info_interfaces_static_fn+0xd5) [0x4a9bc2]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f24358abb50]
aerorestore_1 | Jul 01 2015 18:12:44 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2434b3a95d]
aerorestore_1 | /startup.sh: line 7:     9 Aborted                 (core dumped) /usr/bin/asd --foreground

If you need further infos let me know.

This is telling you that the access-address is not a real address on this machine. For docker you often need to specify this parameter with a virtual flag. Oddly I would expect the output to be ā€œexternal address: ADDRESS is notā€¦ā€ but somehow the address is null. Could you share you config?

Hello,

Sorry youre right, this was on our side. This is the correct stacktrace:

aerorestore_1 | Jul 02 2015 02:37:13 GMT: INFO (drv_ssd): (drv_ssd.c::1840) ns user starting write worker threads
aerorestore_1 | Jul 02 2015 02:37:13 GMT: INFO (drv_ssd): (drv_ssd.c::902) ns user starting defrag threads
aerorestore_1 | Jul 02 2015 02:37:13 GMT: INFO (tsvc): (thr_tsvc.c::916) shared queues: 4 queues with 4 threads each
aerorestore_1 | Jul 02 2015 02:37:13 GMT: INFO (hb): (hb.c::2459) heartbeat socket initialization
aerorestore_1 | Jul 02 2015 02:37:13 GMT: INFO (hb): (hb.c::2473) initializing mesh heartbeat socket : 0.0.0.0:5002
aerorestore_1 | Jul 02 2015 02:37:13 GMT: INFO (info): (thr_info.c::5276)  static external network definition
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Enterprise Edition build 3.5.14
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: found 6 frames
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x54) [0x4895fd]
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x321e0) [0x7f55ad2921e0]
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: frame 2: /lib/x86_64-linux-gnu/libc.so.6(+0x115348) [0x7f55ad375348]
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd(info_interfaces_static_fn+0xa5) [0x4a9b92]
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: frame 4: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f55ae0acb50]
aerorestore_1 | Jul 02 2015 02:37:13 GMT: WARNING (as): (signal.c::162) stacktrace: frame 5: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f55ad33b95d]
aerorestore_1 | /startup.sh: line 7:     8 Segmentation fault      (core dumped) /usr/bin/asd --foregroun

PS: Now two of our docker Containers crashed (SIGABRT?), and it seems they are unstartable again. (Same stacktrace). Sadly we have no logs for them because there were not mapped to the hostā€¦

Update:

Were running on OVH and we read / thought they use special network stuff to allow assigning all ips of a Block. (Including Network-Address and Broadcast-Address). As long as you use the subnetmask 255.255.255.255 as we did. Maybe we understand wrong, maybe its bugged. Anyways this seems to be the issue which raised the stacktrace. Well investigate further and keep you updated.

Ah, thanks, keep us posted.

Hello,

Problem was solved using virtual with access-address. Sadly we somehow skipped this hint on the second post. Thank you for this! We still donā€™t get why this was causing issues even on the main host and why this problem was coming out of the nothing. Suggestions: maybe its helpful to add a is null check and raise a message instead of just crashing the daemon?

Another Update: Somehow AMC still shows some strange stuff:

Anyways it seems to work now~

Were all three configurations the same? Iā€™m not sure how you were able to get a null for the access-address.

Yes, I have filed a ticket, but the configuration that caused the message would be useful.

Hm are you seeing any proxy the latency tab or statistics?

Again I think your configuration would be helpful.

Hey,

Its now working stable again all we did was adding ā€œvirtualā€ to ā€œaccess-address NODE_IPā€. I was thinking about further what we did to our servers. The only thing which was getting in my mind was that we migrated IPs from another server to our aerospike servers. During this step the network was reinitialized, maybe even the routers on our Hoster? After the network was stablized and even the physical servers were rebootet (just to be sure) the issue came out of the nothing. I double checked that our aero config was not changed, were using version control for our server configs and it was confirming this situation. Maybe something went wrong during the IP-migration? Maybe it was wrong the whole time and due to the IP-migration our Hosters routers were reinitalized and ā€˜virtualā€™ was neccesary? The strange thing about it was that is was affecting both of our servers were runningā€¦ Sorry i dont have any clue maybe youre able to figure something outā€¦ Heres our config:

# Aerospike database configuration file for deployments using XDR.

service {
    user root
    group root
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 1024
    migrate-xmit-hwm 200
    migrate-threads 8
    scan-priority 2000
}

logging {
    # Log file must be an absolute path.
    file /var/log/aerospike/aerospike.log {
        context any info
    }
    
    file /var/log/aerospike/aerospike-crit.log {
            context any critical
    }

    file /var/log/aerospike/aerospike-warn.log {
                context any warning
        }

    console {
        context any info
    }
}

network {
    service {
        address any
        port 3000

        # Uncomment the following to set the `access-address` parameter to the
        # IP address of the Docker host. This will the allow the server to correctly
        # publish the address which applications and other nodes in the cluster to
        # use when addressing this node.
        access-address NODE_IP virtual
    }
    heartbeat {
        mode mesh                   # Send heartbeats using Mesh (Unicast) protocol
        address any                # IP of the NIC on which this node is listening
                                    # to heartbeat
        port 3002                   # port on which this node is listening to
                                    # heartbeat
        mesh-seed-address-port NODE1_IP 3002 # IP address for seed node in the cluster
        mesh-seed-address-port NODE2_IP 3002
        

        interval 150                # Number of milliseconds between heartbeats
        timeout 20                  # Number of heartbeat intervals to wait before
    }
    fabric {
        port 3001
    }
    info {
        port 3003
    }
}


namespace ns1 {
#       enable-xdr true # Enable replication for this namespace.
#       xdr-remote-datacenter REMOTE_DC_2
        replication-factor 2
        memory-size 20G
        default-ttl 0
        single-bin true

        storage-engine device {
                file /opt/aerospike/data/ns1.dat
                data-in-memory false # Store data in memory in addition to file.
                filesize 390G
        }
}

namespace ns2 {
        replication-factor 2
        memory-size 1G
        default-ttl 0
        single-bin true

        storage-engine device {
                file /opt/aerospike/data/ns2.dat
                filesize 10G
                cold-start-empty true
        }
}

namespace ns3 {
#       enable-xdr true # Enable replication for this namespace.
#       xdr-remote-datacenter REMOTE_DC_2
        replication-factor 2
        memory-size 2G
        default-ttl 0
        single-bin true

        storage-engine device {
                file /opt/aerospike/data/ns3.dat
                data-in-memory true # Store data in memory in addition to file.
                filesize 10G
                #cold-start-empty true
        }
}

Absolutely! We used the same configuration as we used in Docker. (Were using --net=host). On all participant of the cluster. The only thing what was replaced was the IPs of the nodes.

I dont see any proxy stuff in AMC.

Thank your for your Great Support!

1 Like

Would it be possible to get the network portion of the aerospike.conf for each of the docker containers. Feel free to mask the IPs by replacing the first 3 octets with an X (ie: X.X.X.Y) You also should be able to add all 3 of the IPs as seed nodes to each of the config files:

mesh-seed-address-port NODE1_IP 3002 # IP address for seed node in the cluster
mesh-seed-address-port NODE2_IP 3002
mesh-seed-address-port NODE3_IP 3002

Iā€™m also assuming that the node NODE1_IP,NODE2_IP,NODE3_IP are the IPs used in the 3 different access-address virtual entries. Are all the mesh seed port using 3002? or do you have some using other ports?

To check the IPs published by each nodes of the cluster you could run the aerospike tools docker container

https://registry.hub.docker.com/u/aerospike/aerospike-tools/

and run:

docker run -ti aerospike/aerospike-tools asadm -e "asinfo -v service" -h NODE1_IP

and

docker run -ti aerospike/aerospike-tools asadm -e "asinfo -v services" -h NODE1_IP

Hello lucien,

Were only running two nodes.

Iā€™m also assuming that the node NODE1_IP,NODE2_IP,NODE3_IP are the IPs used in the 3 different access-address virtual entries.

Yeah NODE_IP is replaced by the specific node-IP (NODE1_IP or NODE2_IP).

The used IPs we are using are:

x.x.90.112
x.x.90.117

/etc/network/interfaces looks like this:

iface eth0 inet static
        address x.x.90.112
        netmask 255.255.255.0
        network x.x.90.0
        broadcast x.x.90.255
        gateway x.x.90.254

Both masked octes (x.x) are the same for node1 and node2.

I produced following script which hopefully describes our env a bit better:

#/bin/bash
IP="NODE_IP"
CONT=containerName

function run
{
        echo $USER"#" $@ | sed -r 's/([0-9]{1,3}\.){2}'/x.x./g
        eval $@ | sed -r 's/([0-9]{1,3}\.){2}\b'/x.x./g
}

#aero-tools
run docker run -ti aerospike/aerospike-tools asadm -e \"asinfo -v service\" -h $IP
run docker run -ti aerospike/aerospike-tools nc -z -w5 $IP 3000 \&\& echo $?

#aero-tools net=host
run docker run -ti --net=host aerospike/aerospike-tools asadm -e \"asinfo -v service\" -h $IP
run docker run -ti --net=host aerospike/aerospike-tools nc -z -w5 $IP 3000 \&\& echo $?

#running docker-cont
run docker exec -ti $(docker ps | grep $CONT | head -1 | awk '{print $1}') asadm -e \"asinfo -v service\" -h $IP
run docker exec -ti $(docker ps | grep $CONT | head -1 | awk '{print $1}') nc -z -w5 $IP 3000 \&\& echo $?

#direct-host
run asadm -e \"asinfo -v service\" -h $IP
run nc -z -w5 $IP 3000 \&\& echo $?

Heres the Output:

root# docker run -ti aerospike/aerospike-tools asadm -e "asinfo -v service" -h x.x.90.112
x.x.90.112 (x.x.90.112) returned:
x.x.90.112:3000

x.x.90.117 (x.x.90.117) returned:
Invalid command or Could not connect to node x.x.90.117


root# docker run -ti aerospike/aerospike-tools nc -z -w5 x.x.90.112 3000 && echo 0
error: Command not found: nc
root# docker run -ti --net=host aerospike/aerospike-tools asadm -e "asinfo -v service" -h x.x.90.112
node2 (x.x.90.112) returned:
x.x.90.112:3000

node1 (x.x.90.117) returned:
x.x.90.117:3000

root# docker run -ti --net=host aerospike/aerospike-tools nc -z -w5 x.x.90.112 3000 && echo 0
error: Command not found: nc
root# docker exec -ti e4a726b555d8 asadm -e "asinfo -v service" -h x.x.90.112
node2 (x.x.90.112) returned:
x.x.90.112:3000

node1 (x.x.90.117) returned:
x.x.90.117:3000

root# docker exec -ti e4a726b555d8 nc -z -w5 x.x.90.112 3000 && echo 0
0
root# asadm -e "asinfo -v service" -h x.x.90.112
node2 (x.x.90.112) returned:
x.x.90.112:3000

node1 (x.x.90.117) returned:
x.x.90.117:3000

root# nc -z -w5 x.x.90.112 3000 && echo 0
0

I guess the reason why your tools container is not able to connect is related that its not getting -net=host passed. Anyways were running all containers with --net=host ~

Update: We figured out that there was a missing iptables rules for running docker containers without --net=host. Anyways this should not be related to the virtual issue since we always run our containters with --net=host. Do you think further investigate is necessary? I mean our Enviroment stablized after adding virtual. And if you add a == null check (maybe with a critical log?) then aerospike wont ā€˜justā€™ crash in the future. Furthermore now its even possible to run the 3rd Docker-Container with aerospike on it ~

Thanks for the update. We will definitely improve our logging with input provided. Good catch on the missing Iptables rules.

blonkel -

A JIRA has been filed to follow up on this; itā€™s AER-3946 for your reference. Please stay tuned for updates on our progress.

Regards,

Maud

@blonkel:

Good news! Weā€™ve released Aerospike 3.6.0, which features a number of improvements to batch-read, scan, etc., as well as numerous fixes, including AER-3946.

You can read more about this release on our Aerospike Server CE 3.6.0 release notes and dowload it here.

Please upgrade and let us know whether you still encounter your issue.