SubRec Open Failure in v3.6.2

udf

#1

We are running aerospike v3.6.2 (2 nodes) and ran into this error. It looks like it is related to clustering and most likely happened when 1 of our nodes went down. Thanks in advance for any pointers on what went wrong and how we can fix this?

Snippet from the logs:-

Dec 18 02:17:02 stg-ldn2-2 docker/aerospike-527aad1-1: Dec 18 2015 02:17:02 GMT: INFO (ldt): (ldt_aerospike.c::752) 2D DD E2 21 1E 07 00 24 18 A5 B3 41 0C 84 00 00 00 00 00 00 Failed to open Sub Record rv=-3 181063941961951:0x2ddde2211e07002418a5b3410c84a4ad3bf138df Dec 18 02:17:02 stg-ldn2-2 docker/aerospike-527aad1-1: Dec 18 2015 02:17:02 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(2D DD E2 21 1E 07 00 24 18 A5 B3 41 0C 84 00 00 00 00 00 00) Parent Digest(2D DD E2 21 B2 74 7B B5 18 C3 E7 3D C0 70 E7 FD 99 C7 A6 24) Dec 18 02:17:03 stg-ldn2-2 docker/aerospike-527aad1-1: Dec 18 2015 02:17:03 GMT: INFO (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::5540) [User ERROR]<lib_llist_2014_12_20.B:treeInsert()> Unique Key(65649) Violation


#2

Deepak,

Did you have data loaded in 3.6.2. Or was the data loaded and cluster was upgraded to 3.6.2.

thanks – Raj


#3

Hi Raj,

I had data loaded on v3.6.2, no upgrades done.

Regards. Deepak.


#4

Hi Raj. The full logs are here, it might be helpful in narrowing down the cause - https://dl.dropboxusercontent.com/u/10795357/aerospike%20error%20logs.txt


#5

Deepak,

Based on shared log

$ grep "SubRec Open Failure" "aerospike error logs.txt" | cut -d' ' -f 19-38 | uniq -c
   9 Digest(2D DD E2 21 10 07 00 24 18 4D 34 34 DD 83 00 00 00 00 00 00)
  19 Digest(2D DD E2 21 1E 07 00 24 18 A5 B3 41 0C 84 00 00 00 00 00 00)
  19 Digest(2D DD E2 21 21 07 00 24 18 41 5D 98 14 84 00 00 00 00 00 00)
  19 Digest(2D DD E2 21 22 07 00 24 18 71 81 28 18 84 00 00 00 00 00 00)


$ grep "SubRec Open Failure" "aerospike error logs.txt" | cut -d' ' -f 40-59 | uniq -c
  66 Digest(2D DD E2 21 B2 74 7B B5 18 C3 E7 3D C0 70 E7 FD 99 C7 A6 24)

One of the records seems to have somehow have corrupted LDT tree linkages in it. Can you share little bit more logs. I am looking for ldt gc stats which gets printed at regular interval in the log.

That said I will have to investigate further what could be causing this. For now if this is bounded number of records you may want to clean it up and write them again to fix it.

– Raj


#6

Deepak,

It will help if you could share the typical operation you were performing when the cluster view change was triggerred !!

– Raj


#7

Hi Raj,

Attaching all the logs of I have from the 2 aeropsike instances. https://dl.dropboxusercontent.com/u/10795357/aerospike%20logs%201.txt & https://dl.dropboxusercontent.com/u/10795357/aerospike%20logs%202.txt

There were a lot of writes (also reads) to the ldt during the cluster view change, which seems to have corrupted the ldt tree linkages?

Do these logs help reveal anything interesting?

I have cleared up all the records and re-written them. I’ll try to replicate the error again later today by bringing down an instance of aerospike and writing to ldt during the cluster view change.

Regards. Deepak.


#8

Not much info in the log. Will let you know when I have some insights…

– Raj


#9

Hi Raj,

I am Deepak’s colleague. Here is the link to a small example in Scala to reproduce the LDT Sub Record Open Error using a cluster of two asd nodes: https://dl.dropboxusercontent.com/u/10795357/example-ldt-error.tar. To run the example you can run the following commands in three different shells.

./start-cluster
sbt run
./kill-loop # kills and restarts nodes

We can reproduce the LDT Sub Record Open Error easily on clusters with more than one node and less frequently when running the example with a single node.

When running the code snippet below while running a second loop that kills and restarts a single asd instance, we often encounter the LDT-Sub Record Open Error Deepak describes earlier once asd comes back up.

Is this a bug? Are there some config settings we can use to safely use the large list, e.g. enable-osync or a similar option? Or should we find a way to handle the LDT-Sub Record Open Error in our application instead?

Thanks, Thomas

val c = new AerospikeClient("192.168.99.100", 3000)
val k = new Key("test", "demo-ldt", "ldt-key")
val wp = new WritePolicy
wp.commitLevel = CommitLevel.COMMIT_ALL
wp.consistencyLevel = ConsistencyLevel.CONSISTENCY_ALL

var epoch = 1L
while (true) {
    val l = c.getLargeList(writePolicy, k, "list-bin")
    l.add(Value.get(epoch))
    epoch++
}

#10

Hi Raj,

My colleague Thomas and I have a small test case to reproduce the LDT Sub-Record error. Please let us know if you’ve had a chance to look at it.

Regards. Deepak.


#11

@deepakp and @thomas:

Thank you for posting about LDTs in our forum. Please see the LDT Feature Guide for current LDT recommendations and best practices.


#12

@deepakp, @thomas:

Effective immediately, we will no longer actively support the LDT feature and will eventually remove the API. The exact deprecation and removal timeline will depend on customer and community requirements. Instead of LDTs, we advise that you use our newer List and SortedMap APIs, which are now available in all Aerospike-supported clients at the General Availability level. Read our blog post for details.