Segfault on startup after upgrade


#1

After upgrading to 3.13.0.3 from 3.7.4.1

Jul 03 2017 14:13:58 GMT: FAILED ASSERTION (proto): (transaction.c:415) unexpected transaction origin 0
Jul 03 2017 14:13:58 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.13.0.3 os debian8

Any ideas ?


#2

Could you provide the stack trace that followed this exception?


#3

Sure,

 Jun 30 2017 11:50:19 GMT: FAILED ASSERTION (proto): (transaction.c:415) unexpected transaction origin 0
Jun 30 2017 11:50:19 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.12.1.1 os debian8
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: found 8 frames
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 0: /usr/bin/asd(as_sig_handle_usr1+0x31) [0x485087]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7fa317c7b0e0]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 2: /lib/x86_64-linux-gnu/libpthread.so.0(raise+0x2b) [0x7fa318e4979b]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 3: /usr/bin/asd(cf_fault_event+0x233) [0x5247dd]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 4: /usr/bin/asd(as_tsvc_process_transaction+0x1f0) [0x4c2b4b]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 5: /usr/bin/asd(run_tsvc+0x61) [0x4c3467]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7fa318e420a4]
Jun 30 2017 11:50:19 GMT: INFO (as): (signal.c:214) call stack: frame 7: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa317d2e87d]

We have tried several versions above 3.7.4.* (3.9, 3.10, 3.13) and reproduced every time. This happens every time just before hitting the 2% migrations complete after startup.

Thanks,


#4

Could you share your server config?

Also would you be able to accept a dev build to help pinpoint how this transaction reached the transaction queue?

asd.zero.origin.tar.gz (6.6 MB)


#5
# Aerospike database configuration file.

# service context definition
service {
  user root
  group root
  paxos-single-replica-limit 1
  pidfile /var/run/aerospike/asd.pid
  proto-fd-max 15000
  service-threads 4
  transaction-queues 4
  transaction-threads-per-queue 4
}

# logging context definition
logging {
  file /var/log/aerospike/aerospike.log {
    context any info
  }
}

# network context definition
network {
  service {
    address any
    port 3000
  }

  fabric {
    address any
    port 3001
  }

  info {
    address any
    port 3003
  }

  heartbeat {
    address 239.1.99.222
    interval 75
    mode multicast
    port 9918
    timeout 10
  }
}


# namespace context: rtb
namespace rtb {
  default-ttl 30d
  ldt-enabled true
  memory-size 100G
  replication-factor 2
  storage-engine device {
    device /dev/sda5
    device /dev/sdb5
    device /dev/sdc5
    write-block-size 128k
  }
}

Above the config file. I will come back to you on monday, after negaciating with our production engineers to test your dev build.

Thanks,


#6

I suspect this is LDT related. Other paths are easy to rule out since they quickly set the transaction origin. However, LDTs make use of having it set to 0 for a while. I wasn’t able to find how it could reach a transaction queue while in this state though. The dev build should cast a spotlight on where the issue arises. Could try to reproduce outside of production if you are able to test your LDT load against it. You could even start with a single node cluster, but it may need more than one node to reproduce.

Also be aware that LDTs have been marked as deprecated for some time now, if your app still relies on them know that 3.14.1.1 will be the final release to support them.


#7
Jul 18 2017 10:51:53 GMT: INFO (partition): (partition_balance.c:135) ALLOW MIGRATIONS

Jul 18 2017 10:51:57 GMT: INFO (ldt): (ldt_aerospike.c:737) E4 68 67 8D 36 00 52 53 8D 53 B2 51 EB 2A 00 00 00 00 00 00 Failed to open Sub Record rv=-3 241623454094380<Digest>:0xe468678d360052538d53b251eb2adbc15826942c

Jul 18 2017 10:51:57 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/ldt_common.lua:749) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(E4 68 67 8D 36 00 52 53 8D 53 B2 51 EB 2A 00 00 00 00 00 00) Parent Digest(NULL)

Jul 18 2017 10:51:59 GMT: FAILED ASSERTION (tsvc): (thr_tsvc.c:140) attempting to enqueue a tr with origin 0

Jul 18 2017 10:51:59 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.12.1.1-1-g6cc1100 os debian8

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: found 11 frames

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 0: /usr/bin/asd(as_sig_handle_usr1+0x31) [0x485087]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7f2853f4a0e0]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 2: /lib/x86_64-linux-gnu/libpthread.so.0(raise+0x2b) [0x7f285511879b]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 3: /usr/bin/asd(cf_fault_event+0x233) [0x52480d]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 4: /usr/bin/asd(as_tsvc_enqueue+0xa6) [0x4c28e3]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 5: /usr/bin/asd(proxyer_handle_return_to_sender+0x103) [0x50ecd9]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 6: /usr/bin/asd(proxy_msg_cb+0xf8) [0x50f70d]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 7: /usr/bin/asd() [0x4d3a2a]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 8: /usr/bin/asd() [0x4d4061]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7f28551110a4]

Jul 18 2017 10:51:59 GMT: INFO (as): (signal.c:214) call stack: frame 10: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2853ffd87d]

We were able to test your dev version, stack trace above


#8

Well seeing as it happened immediately after an LDT record failure, my bet is definitely on LDT. You should get off LDT ASAP, as Aerospike is no longer continuing to offer that capability in future versions due to many issues with LDT