'SubRec Open Failure' for Large List & How to deal with old data returning after restart


#1

Hello,

I’m facing the following problems with aerospike. I’ve listed them all in one post in case they are related. I’d appreciate any help to solve these. Thanks!

Information on my current setup:

  • 2 node cluster running on KVM based cloud servers:
    Ubuntu 14.04 LTS
    8GB RAM
    6 CPU cores

  • Go and Ruby clients

  • Versions we’ve used, upgraded using rolling restart:
    2015-08-07 : aerospike-server-community-3.5.15
    2015-11-11 : aerospike-server-community-3.6.4
    2016-01-14 : aerospike-server-community-3.7.1

  • Usage
    Our application provides information on sports events all over the world, and we store most of our data in aerospike now. We do a lot of inserts, updates and delets as new events get created, scores update and then when the event is finished.
    In terms of data we’re storing around 10GB right now, which is not much at all.
    For read and writes we average over a day around 600 writes per second and 200 reads per second.
    During peak hours we average around 2000 writes and 2000 reads per second.

  • We use Large List LDTs

  • asadm info output and aerospike.conf are at the bottom


1) ‘SubRec Open Failure’ errors for Large List LDT’s

This is the highest priority for right now, as this is a data corruption/accessibility issue. The only LTD we use is Large Lists. We are using Large Lists only in the sportsbook and sportsbook_current namespaces.

We are seeing the following errors in the aerospike logs and we get an error: /opt/aerospike/sys/udf/lua/ldt/ldt_common.lua:751: 1422:LDT-Sub Record Open Error in the client, so we are not able to access some of our lists. These are the log entries:

Feb 22 2016 11:02:26 GMT: INFO (ldt): (ldt_aerospike.c::744) 88 3E 54 76 F9 03 00 2B 8A 8F 04 CA B6 97 00 00 00 00 00 00 Failed to open Sub Record rv=-3 123342317701247<Digest>:0x883e5476f903002b8a8f04cab697702ddf08487f
Feb 22 2016 11:02:26 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(88 3E 54 76 F9 03 00 2B 8A 8F 04 CA B6 97 00 00 00 00 00 00) Parent Digest(88 3E 54 76 48 C0 3B 0B 8A 8A 73 C7 A2 DA 02 3D 3E 9B 62 DA)
Feb 22 2016 11:02:26 GMT: INFO (ldt): (ldt_aerospike.c::744) 88 3E 54 76 F9 03 00 2B 8A 8F 04 CA B6 97 00 00 00 00 00 00 Failed to open Sub Record rv=-3 123342317701247<Digest>:0x883e5476f903002b8a8f04cab697702ddf08487f
Feb 22 2016 11:02:26 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::2567) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(88 3E 54 76 F9 03 00 2B 8A 8F 04 CA B6 97 00 00 00 00 00 00) Parent Digest(88 3E 54 76 48 C0 3B 0B 8A 8A 73 C7 A2 DA 02 3D 3E 9B 62 DA)

Also slightly different error message of Parent Digest(NULL):

Feb 24 2016 04:03:33 GMT: INFO (ldt): (ldt_aerospike.c::744) CD E0 48 3E 8F 14 2C 25 E8 EA 4C 0E EE 9D 00 00 00 00 00 00 Failed to open Sub Record rv=-3 67299400250553<Digest>:0xcde0483e8f142c25e8ea4c0eee9d3d355cd874b9
Feb 24 2016 04:03:33 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::3840) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(CD E0 48 3E 8F 14 2C 25 E8 EA 4C 0E EE 9D 00 00 00 00 00 00) Parent Digest(NULL)
Feb 24 2016 04:07:17 GMT: INFO (ldt): (ldt_aerospike.c::744) A4 51 2D 8C E6 85 7E 25 38 DA 35 BE 91 AE 00 00 00 00 00 00 Failed to open Sub Record rv=-3 67298039605702<Digest>:0xa4512d8ce6857e2538da35be91ae3d350bbea9c6
Feb 24 2016 04:07:17 GMT: WARNING (udf): (/opt/aerospike/sys/udf/lua/ldt/lib_llist.lua::3840) [ERROR]<ldt_common_2014_12_20.A:openSubRec()> SubRec Open Failure: Digest(A4 51 2D 8C E6 85 7E 25 38 DA 35 BE 91 AE 00 00 00 00 00 00) Parent Digest(NULL)

From the research I’ve done it might be the bug referred to in SubRec Open Failure affecting LDT in server <3.5.15, however we started with version 3.5.15. From the release notes I assume this is the bug being referred to in the post : (LDT) [AER-3988] - Fixed race conditions during migration and replication of LDT records.

This issue fits our behaviour because we noticed the log errors after node restarts so when we had migrations. I’ve not been able to find any more detail on the bug, so if you could please help to confirm if our case is due to the bug, and the best way to prevent it from occuring.


2) Old updated / deleted data returning after a restart

This is a well documented/discussed issue on this forum and I understand is a design decision not a bug. It is causing me problems though so I would like to know the best way to handle this. Two examples where I need to update/delete data:

  • We have counters stored in aerospike that are updated frequently, for example every time we create a new event.
  • We have a list structure that keeps track of current events, and we need to remove completed events from this list.

I understand that data is only deleted from the in-memory index, and the data on disk is only reclaimed on defrag. So if you restart before the block on disk is defragged, the old data might return depending on which block gets scanned first. This is clear enough.

I’d like to understand the defrag process better - from what I’ve read the write process feeds the defrag queue with the old blocks. This is fine if it takes a few minutes (or even a day) for the blocks to be defragged. What I’m seeing with our counters is that they revert back to values that were used MONTHS ago. So it looks like some blocks will never get defragged, the old data will only “go away” if that block is overridden with new data. My assumption was that the write process fed old blocks to the defrag queue regardless of how full they are, or do they also have to meet the defrag-lwm-pct? (which would be really strange)

I’ve seen mentions to cold-start-empty, but I’m not sure this is a long term solution.


3) “writing pickled failed” errors during migration

We are seeing the following in the logs, but only after server restart and while migrations are ocurring. Once the migrations are finished we do not see it in the logs anymore. The only reference I’ve found online indicates this occurs when the node is out of room to write, however we have ample of memory and disk free. This is not a big worry at the moment as we’re not seeing any application impact as a result:

Feb 16 2016 06:46:21 GMT: INFO (rw): (thr_rw.c::2864) [NOTICE] writing pickled failed(-1):<Digest>:0x5294da82bda7d15f2562f846fb9fd7b33a080355
Feb 16 2016 06:46:21 GMT: INFO (rw): (thr_rw.c::2864) [NOTICE] writing pickled failed(-1):<Digest>:0x5294da82bda7d15f2562f846fb9fd7b33a080355
Feb 16 2016 06:46:21 GMT: INFO (rw): (thr_rw.c::2864) [NOTICE] writing pickled failed(-1):<Digest>:0x5294da82bda7d15f2562f846fb9fd7b33a080355
Feb 16 2016 06:46:21 GMT: INFO (rw): (thr_rw.c::2864) [NOTICE] writing pickled failed(-1):<Digest>:0x5294da82bda7d15f2562f846fb9fd7b33a080355


asadm info output:

root@xxxxxxxx:~# asadm
Aerospike Interactive Shell, version 0.0.13
Found 2 nodes
Online:  xxx.xxx.xxx.xxx:3000, xxx.xxx.xxx.xxx:3000

Admin> info
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node   Build   Cluster      Cluster     Cluster    Free   Free    Migrates   Principal    Objects      Uptime
   .       .      Size   Visibility   Integrity   Disk%   Mem%   (tx,rx,q)           .          .           .
li1    3.7.1         2   True         True           56     74   (0,0,0)     li7         21.478 M   248:16:37
li7    3.7.1         2   True         True           57     74   (0,0,0)     li7         21.478 M   151:27:10
Number of rows: 2

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node               Node                                Fqdn                     Ip   Client     Current     HB         HB
   .                 Id                                   .                      .    Conns        Time   Self    Foreign
li1    BB95E26A8913CF2    xxxxxxxxx.xxxxxxx.xxxxxx.com:3000   xxx.xxx.xxx.xxx:3000      196   193838087      0   11736901
li7    *BB9702D67913CF2   xxxxxxxxx.xxxxxxx.xxxxxx.com:3000   xxx.xxx.xxx.xxx:3000      198   193838087      0    7167085
Number of rows: 2

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node             Namespace   Avail%   Evictions      Master     Replica     Repl     Stop         Disk    Disk     HWM          Mem     Mem    HWM      Stop
   .                     .        .           .     Objects     Objects   Factor   Writes         Used   Used%   Disk%         Used   Used%   Mem%   Writes%
li1    sportsbook                36           0    10.291 M    10.343 M        2   false      9.142 GB      46      50     1.236 GB      31     60        90
li7    sportsbook                39           0    10.343 M    10.291 M        2   false      8.867 GB      45      50     1.231 GB      31     60        90
li1    sportsbook_xxxxxxxx       68           0   344.768 K   347.628 K        2   false    156.915 MB      31      50    59.277 MB      24     60        90
li7    sportsbook_xxxxxxxx       68           0   347.628 K   344.768 K        2   false    156.915 MB      31      50    59.277 MB      24     60        90
li1    sportsbook_current        82           0    56.220 K    56.866 K        2   false    124.928 MB      13      50   111.801 MB      11     60        90
li7    sportsbook_current        82           0    56.866 K    56.220 K        2   false    124.928 MB      13      50   111.800 MB      11     60        90
li1    sportsbook_zzzzzzzz       96           0     2.443 K     2.229 K        2   false      2.042 MB       2      50   966.358 KB       1     60        90
li7    sportsbook_zzzzzzzz       96           0     2.229 K     2.443 K        2   false      2.042 MB       2      50   966.458 KB       1     60        90
li1    sportsbook_static         95           0     1.957 K     1.858 K        2   false      1.744 MB       3      50   959.506 KB       2     60        90
li7    sportsbook_static         94           0     1.858 K     1.957 K        2   false      1.744 MB       3      50   958.015 KB       2     60        90
li1    sportsbook_volatile      N/E           0    14.587 K    14.471 K        2   false           N/E     N/E      50     2.875 MB       5     60        90
li7    sportsbook_volatile      N/E           0    14.471 K    14.587 K        2   false           N/E     N/E      50     2.875 MB       5     60        90
Number of rows: 12

aerospike.conf:

# Aerospike database configuration file.

service {
        user aerospike
        group aerospike
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 4
        transaction-queues 4
        transaction-threads-per-queue 4
        proto-fd-max 15000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
                access-address xxx.xxx.xxx.xxx #public IP of the server
        }

        heartbeat {
                # mode multicast
                # address 239.1.99.222
                # port 9918

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.
                mode mesh
                address 192.168.204.68 # heartbeat IP. even though config ref says only applies to multicast
                # interface-address 192.168.204.68 # config ref says sets IP for intracluster comms. think it's only for multicast
                port 3002 # Heartbeat port for this node.

                # List one or more other nodes, one ip-address & port per line:
                mesh-seed-address-port 192.168.204.68  3002 #aero-stg-01
                mesh-seed-address-port 192.168.204.105 3002 #aero-stg-02

                interval 150
                timeout 10
        }

        fabric {
                address 192.168.204.68
                port 3001
        }

        info {
                address 192.168.204.68
                port 3003
        }
}

# frequently accessed data that only needs to be in memory
namespace sportsbook_volatile {
        replication-factor 2
        memory-size 64M
        default-ttl 24H           # Expire data after 24 hours
        #high-water-disk-pct 50   # How full may the disk become before the server begins eviction (expiring records early). Dont need it for in memory only
        high-water-memory-pct 60  # How full may the memory become before the server begins eviction (expiring records early)
        stop-writes-pct 90        # How full may the memory become before we disallow new writes

        #ldt-enabled true         # Don't need large data types

        storage-engine memory     # No disk persistence, only stored in memory

}

# API and state data for current events
namespace sportsbook_current {
        replication-factor 2
        memory-size 1G
        default-ttl 180d          # Expire after 6 months
        high-water-disk-pct 50    # How full may the disk become before the server begins eviction (expiring records early)
        high-water-memory-pct 60  # How full may the memory become before the server begins eviction (expiring records early)
        stop-writes-pct 90        # How full may the memory become before we disallow new writes

        ldt-enabled true

        storage-engine device {
                file /opt/aerospike/data/sportsbook_current.dat
                filesize 1G
                data-in-memory true   # Store data in memory in addition to file.
                write-block-size 512K # adjust block size to make it efficient for SSDs.
                defrag-lwm-pct 60     # Defrag blocks that are less than this percentage filled. Higher value means more blocks will be defrag'd
        }
}

# relatively static data like markets and competitions
namespace sportsbook_static {
        replication-factor 2
        memory-size 64M
        default-ttl 0             # Dont expire/evict any data
        high-water-disk-pct 50    # How full may the disk become before the server begins eviction (expiring records early)
        high-water-memory-pct 60  # How full may the memory become before the server begins eviction (expiring records early)
        stop-writes-pct 90        # How full may the memory become before we disallow new writes

        ldt-enabled true

        storage-engine device {
                file /opt/aerospike/data/sportsbook_static.dat
                filesize 64M
                data-in-memory true   # Store data in memory in addition to file.
                write-block-size 128K # adjust block size to make it efficient for SSDs.
                defrag-lwm-pct 60     # Defrag blocks that are less than this percentage filled. Higher value means more blocks will be defrag'd
        }
}

# main sportsbook data store
namespace sportsbook {
        replication-factor 2
        memory-size 4G
        default-ttl 0             # Dont expire/evict any data
        high-water-disk-pct 50    # How full may the disk become before the server begins eviction (expiring records early)
        high-water-memory-pct 60  # How full may the memory become before the server begins eviction (expiring records early)
        stop-writes-pct 90        # How full may the memory become before we disallow new writes

        ldt-enabled true

        storage-engine device {
                file /opt/aerospike/data/sportsbook.dat
                filesize 20G
                #data-in-memory true   # Store data in memory in addition to file.
                write-block-size 512K # adjust block size to make it efficient for SSDs.
                defrag-lwm-pct 60     # Defrag blocks that are less than this percentage filled. Higher value means more blocks will be defrag'd
        }
}

# xxxxxxxx lookup tables
namespace sportsbook_xxxxxxxx {
        replication-factor 2
        memory-size 256M
        default-ttl 0             # Dont expire/evict any data
        high-water-disk-pct 50    # How full may the disk become before the server begins eviction (expiring records early)
        high-water-memory-pct 60  # How full may the memory become before the server begins eviction (expiring records early)
        stop-writes-pct 90        # How full may the memory become before we disallow new writes

        #ldt-enabled true

        storage-engine device {
                file /opt/aerospike/data/sportsbook_xxxxxxxx.dat
                filesize 512M
                data-in-memory true   # Store data in memory in addition to file.
                write-block-size 128K # adjust block size to make it efficient for SSDs.
                defrag-lwm-pct 60     # Defrag blocks that are less than this percentage filled. Higher value means more blocks will be defrag'd
        }
}

# zzzzzzzz lookup tables
namespace sportsbook_zzzzzzzz {
        replication-factor 2
        memory-size 128M
        default-ttl 0             # Dont expire/evict any data
        high-water-disk-pct 50    # How full may the disk become before the server begins eviction (expiring records early)
        high-water-memory-pct 60  # How full may the memory become before the server begins eviction (expiring records early)
        stop-writes-pct 90        # How full may the memory become before we disallow new writes

        ldt-enabled true

        storage-engine device {
                file /opt/aerospike/data/sportsbook_zzzzzzzz.dat
                filesize 128M
                data-in-memory true   # Store data in memory in addition to file.
                write-block-size 128K # adjust block size to make it efficient for SSDs.
                defrag-lwm-pct 60     # Defrag blocks that are less than this percentage filled. Higher value means more blocks will be defrag'd
        }
}