LDT Data corruption after aerospike restart (AER-4539)


#1

Hi everybody, I have a namespace with sorted ldt. In config I have option os_sync = true in device section and I use /dev/sdX. All works fine while I don not restart AS. after restart in logs AS write data load progress, and load it successfully. But when I try to apply my UDF during scan I have an error.

Oct 12 2015 19:03:24 GMT: WARNING (scan): (scan.c::390) send error - fd 440 sz 498 rv 262
Oct 12 2015 19:03:24 GMT: INFO (scan): (scan.c::710) finished basic scan job 86 (1)
Oct 12 2015 19:03:24 GMT: WARNING (scan): (scan.c::384) send error - fd 441 rv -1 Broken pipe
Oct 12 2015 19:03:24 GMT: WARNING (scan): (scan.c::384) send error - fd 160 rv -1 Broken pipe
Oct 12 2015 19:03:24 GMT: INFO (scan): (scan.c::710) finished basic scan job 88 (1)
Oct 12 2015 19:03:24 GMT: INFO (scan): (scan.c::710) finished basic scan job 87 (1)
Oct 12 2015 19:03:24 GMT: INFO (scan): (scan.c::647) starting basic scan job 89 {crawler-finder:finder-queue} priority 2, sample-pct 100, metadata-only, fail-on-cluster-change
Oct 12 2015 19:03:24 GMT: INFO (scan): (scan.c::647) starting basic scan job 90 {crawler-finder:finder-queue} priority 2, sample-pct 100, metadata-only, fail-on-cluster-change
Oct 12 2015 19:03:24 GMT: WARNING (scan): (scan.c::296) scan msg has unrecognized set updater-queue
Oct 12 2015 19:03:24 GMT: INFO (ldt): (ldt_aerospike.c::752) 3D 8D 0A BF AA DB A9 21 52 20 98 60 A2 CC 00 00 00 00 00 00 Failed to open Sub Record rv=-3 225646161227342<Digest>:0x3d8d0abfaadba92152209860a2cccd395748ee4e
Oct 12 2015 19:03:24 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(3D 8D 0A BF AA DB A9 21 52 20 98 60 A2 CC 00 00 00 00 00 00) Parent Digest(3D 8D 0A BF F1 DA AF 19 52 70 A2 67 E5 47 A5 56 91 CB 10 85)
Oct 12 2015 19:03:24 GMT: INFO (ldt): (ldt_aerospike.c::752) 3D 8D 0A BF AA DB A9 21 52 20 98 60 A2 CC 00 00 00 00 00 00 Failed to open Sub Record rv=-3 225646161227342<Digest>:0x3d8d0abfaadba92152209860a2cccd395748ee4e
Oct 12 2015 19:03:24 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(3D 8D 0A BF AA DB A9 21 52 20 98 60 A2 CC 00 00 00 00 00 00) Parent Digest(3D 8D 0A BF F1 DA AF 19 52 70 A2 67 E5 47 A5 56 91 CB 10 85)

.........
.........
Oct 12 2015 19:05:33 GMT: INFO (ldt): (ldt_aerospike.c::752) BB 50 52 52 11 F6 AC 21 E1 A5 6A 15 D7 D7 00 00 00 00 00 00 Failed to open Sub Record rv=-3 225647073254128<Digest>:0xbb50525211f6ac21e1a56a15d7d7cd398da55af0
Oct 12 2015 19:05:33 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(BB 50 52 52 11 F6 AC 21 E1 A5 6A 15 D7 D7 00 00 00 00 00 00) Parent Digest(BB 50 52 52 D2 54 E6 5B E1 C3 D9 37 89 A4 C5 53 C5 5C 8C DE)
Oct 12 2015 19:05:33 GMT: INFO (ldt): (ldt_aerospike.c::752) BB 50 52 52 11 F6 AC 21 E1 A5 6A 15 D7 D7 00 00 00 00 00 00 Failed to open Sub Record rv=-3 225647073254128<Digest>:0xbb50525211f6ac21e1a56a15d7d7cd398da55af0
Oct 12 2015 19:05:33 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(BB 50 52 52 11 F6 AC 21 E1 A5 6A 15 D7 D7 00 00 00 00 00 00) Parent Digest(BB 50 52 52 D2 54 E6 5B E1 C3 D9 37 89 A4 C5 53 C5 5C 8C DE)

And here a part of log from my scan
time="2015-10-12T22:10:34+03:00" level=warning msg="failed to execute llist.add operation" error="/opt/aerospike/sys/udf/lua/ldt/ldt_common.lua:751: 1422:LDT-Sub Record Open Error" ns=crawler-finder set=finder-queue urlInfo=map[url:http://aksmarket.ru/accessories/battery/Sony-Ericsson/J220/?change_city=%D0%AF%D1%80%D0%BE%D1%81%D0%BB%D0%B0%D0%B2%D0%BB%D1%8C prio:AZ target:linker_finder key:AZ-c2afa84ec026acc9981e4f797ed68d2b]
time="2015-10-12T22:10:34+03:00" level=warning msg="failed to execute llist.add operation" error="/opt/aerospike/sys/udf/lua/ldt/ldt_common.lua:751: 1422:LDT-Sub Record Open Error" ns=crawler-finder set=finder-queue urlInfo=map[url:http://aksmarket.ru/accessories/battery/Sony-Ericsson/J220/?change_city=%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F prio:AZ target:linker_finder key:AZ-755c513b20e60b3b797a6e80176c1a70]

before stop AS i create a back via as backup utility. But I can’t restore them.

root@spike:/back# asrestore --host 127.0.0.1 -d /back/linker

2015-10-12 21:00:16 GMT [INF] [ 2905] Starting restore to 127.0.0.1:3000 (bins: [all], sets: [all]) from /back/linker
2015-10-12 21:00:16 GMT [INF] [ 2905] Processing 1 node(s)
2015-10-12 21:00:16 GMT [INF] [ 2905] Found 98 backup file(s) in /back/linker
.....
2015-10-12 21:00:16 GMT [INF] [ 2905] Restoring 2 UDF file(s)
2015-10-12 21:00:16 GMT [INF] [ 2905] Restoring 0 secondary index(es)
2015-10-12 21:00:16 GMT [INF] [ 2905] Restoring records
2015-10-12 21:00:16 GMT [INF] [ 2925] Restoring /back/linker/BB9CD392E290C00_00054.asb
2015-10-12 21:00:16 GMT [INF] [ 2926] Restoring /back/linker/BB9CD392E290C00_00033.asb
2015-10-12 21:00:16 GMT [INF] [ 2924] Restoring /back/linker/BB9CD392E290C00_00028.asb
2015-10-12 21:00:16 GMT [INF] [ 2926] Opened backup file /back/linker/BB9CD392E290C00_00033.asb
2015-10-12 21:00:16 GMT [INF] [ 2927] Restoring /back/linker/BB9CD392E290C00_00026.asb
2015-10-12 21:00:16 GMT [INF] [ 2925] Opened backup file /back/linker/BB9CD392E290C00_00054.asb
2015-10-12 21:00:16 GMT [INF] [ 2929] Restoring /back/linker/BB9CD392E290C00_00067.asb
2015-10-12 21:00:16 GMT [INF] [ 2924] Opened backup file /back/linker/BB9CD392E290C00_00028.asb
2015-10-12 21:00:16 GMT [INF] [ 2928] Restoring /back/linker/BB9CD392E290C00_00029.asb
2015-10-12 21:00:16 GMT [INF] [ 2929] Opened backup file /back/linker/BB9CD392E290C00_00067.asb
2015-10-12 21:00:16 GMT [INF] [ 2930] Restoring /back/linker/BB9CD392E290C00_00068.asb
2015-10-12 21:00:16 GMT [INF] [ 2927] Opened backup file /back/linker/BB9CD392E290C00_00026.asb
2015-10-12 21:00:16 GMT [INF] [ 2932] Restoring /back/linker/BB9CD392E290C00_00074.asb
2015-10-12 21:00:16 GMT [INF] [ 2931] Restoring /back/linker/BB9CD392E290C00_00078.asb
2015-10-12 21:00:16 GMT [INF] [ 2932] Opened backup file /back/linker/BB9CD392E290C00_00074.asb
2015-10-12 21:00:16 GMT [INF] [ 2930] Opened backup file /back/linker/BB9CD392E290C00_00068.asb
2015-10-12 21:00:16 GMT [INF] [ 2928] Opened backup file /back/linker/BB9CD392E290C00_00029.asb
2015-10-12 21:00:16 GMT [INF] [ 2933] Restoring /back/linker/BB9CD392E290C00_00044.asb
2015-10-12 21:00:16 GMT [INF] [ 2931] Opened backup file /back/linker/BB9CD392E290C00_00078.asb
2015-10-12 21:00:16 GMT [INF] [ 2934] Restoring /back/linker/BB9CD392E290C00_00022.asb
2015-10-12 21:00:16 GMT [INF] [ 2933] Opened backup file /back/linker/BB9CD392E290C00_00044.asb
2015-10-12 21:00:16 GMT [INF] [ 2934] Opened backup file /back/linker/BB9CD392E290C00_00022.asb
2015-10-12 21:00:16 GMT [ERR] [ 2925] Encoded LDT size mismatch: 57032 vs. 5988 (line 18, col 6003)
2015-10-12 21:00:16 GMT [ERR] [ 2925] Error while parsing encoded LDT bin value
2015-10-12 21:00:16 GMT [ERR] [ 2925] Error while parsing record
2015-10-12 21:00:16 GMT [ERR] [ 2925] Error while restoring backup file /back/linker/BB9CD392E290C00_00054.asb (line 18)
2015-10-12 21:00:16 GMT [INF] [ 2936] Restoring /back/linker/BB9CD392E290C00_00061.asb
2015-10-12 21:00:16 GMT [INF] [ 2936] Opened backup file /back/linker/BB9CD392E290C00_00061.asb
2015-10-12 21:00:16 GMT [ERR] [ 2924] Encoded LDT size mismatch: 294448 vs. 31604 (line 10, col 31620)
2015-10-12 21:00:16 GMT [ERR] [ 2924] Error while parsing encoded LDT bin value
2015-10-12 21:00:16 GMT [ERR] [ 2924] Error while parsing record
2015-10-12 21:00:16 GMT [ERR] [ 2924] Error while restoring backup file /back/linker/BB9CD392E290C00_00028.asb (line 10)
2015-10-12 21:00:17 GMT [INF] [ 2923] 2 UDF file(s), 0 secondary index(es), 7 record(s) (38 KiB/s, 7 rec/s, 5603 B/rec, backed off: 0)
2015-10-12 21:00:17 GMT [INF] [ 2923] Expired 0 : skipped 0 : inserted 0 : failed 7 (existed 0, fresher 7)
2015-10-12 21:00:17 GMT [INF] [ 2923] 0% complete, ~8d10h9m57s remaining

here a part of my config:

namespace link {
    replication-factor 1
    memory-size 30G
    ldt-enabled true
    default-ttl 0

    storage-engine device {
        max-write-cache 128M
        device /dev/disk/by-partuuid/d3d9f76d-0543-4f38-b610-828b7633f04d
        enable-osync true
        scheduler-mode noop
        write-block-size 128K
        data-in-memory false
    }
}

Can anybody help me to understand where I made a mistake? and how I can to stop Aerospike correctly without data corruption?


#2

errors wich I have got when try to restore my asbackup on empty namespace

think my data was lost for ever :cry:


#3

The LDT type is not fully mature yet. We are working on improving those. Which version was this against? Make sure to update to the latest version (3.6.3) and use the latest backup/restore tool.


#4

I still have this troubles in 3.6.3 and also in 3.6.4. I have opened issue in your tracker with number AER-4539.


#5

Linking to @max83’s forum issue that originally led us to file JIRA ticket AER-4539: Asrestore: Restore data from .asb files (AER-4539)


#6

Again I found this error, My LDT corrupted, I did not stoped aersospike and did not use restore. It happen right in production server in working mode. Scan return this errors and I can`t access to the data. Please help to determine the cause of error. Again and again I lost my data, It is very critical for me.


#7

@max83,

Thank you for posting about LDTs in our forum. Please see the LDT Feature Guide for current LDT recommendations and best practices.


#8

@max83,

Effective immediately, we will no longer actively support the LDT feature and will eventually remove the API. The exact deprecation and removal timeline will depend on customer and community requirements. Instead of LDTs, we advise that you use our newer List and SortedMap APIs, which are now available in all Aerospike-supported clients at the General Availability level. Read our blog post for details.


#9

Large map (lmap) is deprecated too?