SIGFPE received, aborting Aerospike

I tried to deploy the community edition of aerospike on kubernetes using the helm chart. I am seeing following error. Can someone point me to the possible reason?

This error is only happening on pod-0 of statefulsets. Other 2 pods are in running state but looks like waiting to connect to pod-0 which is dead.

Sep 24 2020 21:52:39 GMT: INFO (fabric): (socket.c:815) Started fabric endpoint 0.0.0.0:3001
Sep 24 2020 21:52:39 GMT: INFO (hb): (hb.c:7077) mtu of the network is 1496
Sep 24 2020 21:52:39 GMT: INFO (hb): (socket.c:815) Started mesh heartbeat endpoint 0.0.0.0:3002
Sep 24 2020 21:52:39 GMT: INFO (nsup): (nsup.c:187) starting namespace supervisor threads
Sep 24 2020 21:52:39 GMT: INFO (service): (service.c:908) starting reaper thread
Sep 24 2020 21:52:39 GMT: INFO (service): (socket.c:815) Started client endpoint 0.0.0.0:3000
Sep 24 2020 21:52:39 GMT: INFO (service): (service.c:193) starting accept thread
Sep 24 2020 21:52:39 GMT: INFO (info-port): (thr_info_port.c:298) starting info port thread
Sep 24 2020 21:52:39 GMT: INFO (info-port): (socket.c:815) Started info endpoint 0.0.0.0:3003
Sep 24 2020 21:52:39 GMT: INFO (as): (as.c:408) service ready: soon there will be cake!
Sep 24 2020 21:52:39 GMT: INFO (hb): (hb.c:6364) removing self seed entry host:aerospike-0.aerospike.default.svc.solutions3.diamanti.com port:3002
Sep 24 2020 21:52:39 GMT: INFO (hb): (hb.c:5740) removed mesh seed host:aerospike-0.aerospike.default.svc.solutions3.diamanti.com port 3002
Sep 24 2020 21:52:39 GMT: INFO (hb): (hb.c:4357) found redundant connections to same node, fds 27 22 - choosing at random
Sep 24 2020 21:52:40 GMT: INFO (clustering): (clustering.c:6355) principal node - forming new cluster with succession list: bb907801c00a28e
Sep 24 2020 21:52:40 GMT: INFO (clustering): (clustering.c:5794) applied new cluster key 5381145975f5
Sep 24 2020 21:52:40 GMT: INFO (clustering): (clustering.c:5796) applied new succession list bb907801c00a28e
Sep 24 2020 21:52:40 GMT: INFO (clustering): (clustering.c:5798) applied cluster size 1
Sep 24 2020 21:52:40 GMT: INFO (exchange): (exchange.c:2319) data exchange started with cluster key 5381145975f5
Sep 24 2020 21:52:40 GMT: INFO (exchange): (exchange.c:2671) exchange-compatibility-id: self 7 cluster-min 0 -> 7 cluster-max 0 -> 7
Sep 24 2020 21:52:40 GMT: INFO (exchange): (exchange.c:3219) received commit command from principal node bb907801c00a28e
Sep 24 2020 21:52:40 GMT: INFO (exchange): (exchange.c:3182) data exchange completed with cluster key 5381145975f5
Sep 24 2020 21:52:40 GMT: INFO (partition): (partition_balance.c:993) {test} replication factor is 1
Sep 24 2020 21:52:40 GMT: INFO (partition): (partition_balance.c:965) {test} rebalanced: expected-migrations (0,0,0) fresh-partitions 0
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:168) NODE-ID bb907801c00a28e CLUSTER-SIZE 1
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:249)    cluster-clock: skew-ms 0
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:272)    system: total-cpu-pct 0 user-cpu-pct 0 kernel-cpu-pct 0 free-mem-kbytes 121374476 free-mem-pct 92
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:290)    process: cpu-pct 1 heap-kbytes (1081944,1082472,1105920) heap-efficiency-pct 97.8
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:301)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:322)    fds: proto (0,0,0) heartbeat (0,2,2) fabric (0,0,0)
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:330)    heartbeat-received: self 2 foreign 0
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:360)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (0,0)
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:423) {test} objects: all 0 master 0 prole 0 non-replica 0
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:483) {test} migrations: complete
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:501) {test} memory-usage: total-bytes 0 index-bytes 0 sindex-bytes 0 data-bytes 0 used-pct 0.00
Sep 24 2020 21:52:49 GMT: INFO (info): (ticker.c:560) {test} device-usage: used-bytes 0 avail-pct 99
Sep 24 2020 21:52:57 GMT: WARNING (as): (signal.c:129) SIGFPE received, aborting Aerospike Community Edition build 5.0.0.4 os debian9
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:604) stacktrace: registers: rax 0000000000000000 rbx 00007fbb6f5f6e14 rcx 0000000000000000 rdx 0000000000000000 rsi 0000000000000000 rdi 00007fbbb380d008 rbp 00005601a972c300 rsp 00007fbb6f5f6c80 r8 0000000000000001 r9 0000000000000000 r10 0000000000000000 r11 000e6e26eb55e7c8 r12 00005601a9734300 r13 00005601a972b240 r14 00007fbbb380d050 r15 00007fbb6f5f6e10 rip 00005601a9294250
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:617) stacktrace: found 8 frames: 0x169c33 0xb9554 0x7fbbfc1a00e0 0xc8250 0xc85a3 0x157060 0x7fbbfc1964a4 0x7fbbfb015d0f offset 0x5601a91cc000
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 0: asd(cf_log_stack_trace+0xe8) [0x5601a9335c33]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 1: asd(as_sig_handle_fpe+0x62) [0x5601a9285554]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 2: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0) [0x7fbbfc1a00e0]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 3: asd(+0xc8250) [0x5601a9294250]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 4: asd(+0xc85a3) [0x5601a92945a3]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 5: asd(+0x157060) [0x5601a9323060]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4) [0x7fbbfc1964a4]
Sep 24 2020 21:52:57 GMT: WARNING (as): (log.c:627) stacktrace: frame 7: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fbbfb015d0f]

I ran the following against your build:

addr2line -fie /usr/bin/asd 0x169c33 0xb9554 0x7fbbfc1a00e0 0xc8250 0xc85a3 0x157060 0x7fbbfc1964a4 0x7fbbfb015d0f

Which returned:

cf_log_stack_trace
/work/source/cf/src/log.c:609
set_handler
/work/source/as/src/base/signal.c:79
reraise_signal
/work/source/as/src/base/signal.c:94
as_sig_handle_fpe
/work/source/as/src/base/signal.c:132
??
??:0
select_sid
/work/source/as/src/base/service.c:524
select_sid_adq
/work/source/as/src/base/service.c:543
assign_socket
/work/source/as/src/base/service.c:494
ck_pr_faa_64
/work/source/modules/common/target/Linux-x86_64/include/aerospike/ck/gcc/x86_64/ck_pr.h:308
cf_atomic64_add
/work/source/modules/common/target/Linux-x86_64/include/citrusleaf/cf_atomic.h:84
run_accept
/work/source/as/src/base/service.c:467
detached_shim_fn
/work/source/cf/src/cf_thread.c:218
??
??:0
??
??:0

This would indicates that this code:

static uint32_t
select_sid(void)
{
	static uint32_t rr = 0;

	return rr++ % g_config.n_service_threads;
}

triggered an illegal arithmetic operation (SIGFPE).

I take this to mean that the n_service_threads configuration was somehow zero. But I haven’t been able to find a way for that to be true.

I have few modification to helm chart: 1- cluster domain 2- use my own storage class for PV

Let me know if that might help to narrow down.

Is there a stable version of chart/images which might work?

As far as I can tell, this shouldn’t happen, the code prohibits that configuration from ever being zero.

Is this issue reproducible?

Is the RAM on the node ECC?

Hi Kevin,

Gmail filtered this messages so couldnt see till now. and these messages were not showing up on the discussion thread. I am not falimir with "hardware logging context. Do you want to me to put that json block as it is in the conf file? or add some hardware detail there?

Thanks, -Arvind

Sorry for the confusion, I removed those comments because they were a dead end. If the server had found zero cpus it would have crashed well before the point where this crash occurred.

But I am interested in these questions:

yes issue is reproducible everytime. Regarding RAM I am not sure but I am assuming they are ECC, these are Intel 1U servers.

  • Could you share your /etc/aerospike/aerospike.conf?
  • Are you configuring auto-pin adq?

I am using the mostly default settings from helm chart (except storage configuration and kubernetes cluster domain)

Here is my config file from container log:

Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) # Aerospike database configuration file.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) # This stanza must come first.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) service {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   cluster-name aero-aerospike
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   user root
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   group root
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   pidfile /var/run/aerospike/asd.pid
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   proto-fd-max 15000
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) logging {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   # Log file must be an absolute path.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   file /dev/null {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           context any info
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   # Send log messages to stdout
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   console {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           context any info
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) network {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   service {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           address any
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           port 3000
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # Uncomment the following to set the `access-address` parameter to the
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # IP address of the Docker host. This will the allow the server to correctly
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # publish the address which applications and other nodes in the cluster to
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # use when addressing this node.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # access-address <IPADDR>
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   heartbeat {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)         address any
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # mesh is used for environments that do not support multicast
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           mode mesh
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           port 3002
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # use asinfo -v 'tip:host=<ADDR>;port=3002' to inform cluster of
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # other mesh nodes
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           # mesh-seed-placeholder
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           mesh-seed-address-port aero-aerospike-0.aero-aerospike.default.svc.solutions3.diamanti.com 3002
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           interval 150
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           timeout 10
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   fabric {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)       address any
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           port 3001
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   info {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)       address any
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           port 3003
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) namespace test {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   replication-factor 2
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   memory-size 1G
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   default-ttl 0
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) # storage-engine memory
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   # To use file storage backing, comment out the line above and use the
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   # following lines instead.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   storage-engine device {
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           file /opt/aerospike/data/aero-aerospike-0-test.dat
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           filesize 1G
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)           data-in-memory true # Store data in memory in addition to file.
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768)   }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3768) }
Sep 28 2020 18:35:50 GMT: INFO (config): (cfg.c:3795) system file descriptor limit: 1048576, proto-fd-max: 15000

Ok, doesn’t look like you are using ADQ, so the stack trace I printed doesn’t make sense. In the same container, could you run:

addr2line -fie /usr/bin/asd 0x169c33 0xb9554 0x7fbbfc1a00e0 0xc8250 0xc85a3 0x157060 0x7fbbfc1964a4 0x7fbbfb015d0f

Hi @kporter, for some reason when I deployed it again last night now aerospike cluster is working. I don’t see any error. Which is good but same time very odd as I have not changed/fixed anything to make it work. Not sure if its helpful now but below is the output of cmd you asked for:

root@aero-aerospike-0:/# addr2line -fie /usr/bin/asd 0x169c33 0xb9554 0x7fbbfc1a00e0 0xc8250 0xc85a3 0x157060 0x7fbbfc1964a4 0x7fbbfb015d0f
cf_log_stack_trace
/work/source/cf/src/log.c:609
set_handler
/work/source/as/src/base/signal.c:79
reraise_signal
/work/source/as/src/base/signal.c:94
as_sig_handle_fpe
/work/source/as/src/base/signal.c:132
??
??:0
select_sid
/work/source/as/src/base/service.c:524
select_sid_adq
/work/source/as/src/base/service.c:543
assign_socket
/work/source/as/src/base/service.c:494
ck_pr_faa_64
/work/source/modules/common/target/Linux-x86_64/include/aerospike/ck/gcc/x86_64/ck_pr.h:308
cf_atomic64_add
/work/source/modules/common/target/Linux-x86_64/include/citrusleaf/cf_atomic.h:84
run_accept
/work/source/as/src/base/service.c:467
detached_shim_fn
/work/source/cf/src/cf_thread.c:218
??
??:0
??
??:0