Secondary indexes very slow in cold restart

secondary
index

#1
Aug 12 2015 00:41:14 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 352892668 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:16 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 353195842 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:18 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 353498033 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:20 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 353792425 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:22 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 354082077 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:24 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 354381838 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:26 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 354676621 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:28 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 354978549 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%
Aug 12 2015 00:41:30 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {ssd} loaded 355274348 records, 0 subrecords, /mnt/disk2/aerospike-data/ssd.dat 15%

2.3 million records in 16 secs

Aug 12 2015 00:43:34 GMT: INFO (sindex): (secondary_index.c::5087)  Sindex-ticker: ns=ssd si=<all> obj-scanned=500000 si-mem-used=13367686 progress=0% est-time=84800834 ms
Aug 12 2015 00:43:34 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 609628M (4877029 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1019 (15.4/s), defrag-q 14791 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:43:54 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 609737M (4877903 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1393 (18.7/s), defrag-q 13543 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:44:14 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 609842M (4878739 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1797 (20.2/s), defrag-q 12303 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:44:34 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 609937M (4879502 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 2206 (20.5/s), defrag-q 11131 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:44:54 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 610020M (4880162 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 2600 (19.7/s), defrag-q 10077 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:45:14 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 610107M (4880860 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 3061 (23.0/s), defrag-q 8919 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:45:34 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /mnt/disk2/aerospike-data/ssd.dat: used 98481304832, contig-free 610192M (4881536 wblocks), swb-free 15, n-w 0, w-q 0 w-tot 3550 (24.5/s), defrag-q 7753 defrag-tot 22259 (0.0/s)
Aug 12 2015 00:45:34 GMT: INFO (sindex): (secondary_index.c::5087)  Sindex-ticker: ns=ssd si=<all> obj-scanned=1000000 si-mem-used=26554061 progress=0% est-time=85245171 ms

0.5 million in 120 secs

From what i understand, The first set of logs are for loading into memory, the primary index and the 2nd set of logs are for creating secondary indexes. The estimated time for completing the sec index step is 1 day and this is just unacceptable. Does this mean we have some bad configuration?

Also I would like to know what the percentages mean here? The primary index loading part is completed at 15%. Then it started the secondary index part.


Elapsed time to build secondary index in cold restart vs elapsed time to write its data
#2

The percentage of 15% in the primary index is actually percentage of disk read to load the index to memory. Completed at 15% would mean that it didn’t find any additional data to load past 15% of the disk.

We’d like to know more about your secondary index and your storage configuration. We see storage device configured as /mnt/disk2/aerospike-data/ssd.dat . Would that be for a storage in memory with persistence to disk? Could you share your storage config and type of disk being used? Possibly post an anonymized version of your aerospike.conf?


#3

Here is my config. We have 2 aerospike nodes with 700GB ssd storage and 50GB memory each. We have around 350M records. and secondary index on ~150 million of them. It took ~27 hours to restart each server. Most of the time is spent on secondary index loading.

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /pathto/pid_file
    service-threads 8
    transaction-queues 8
    transaction-threads-per-queue 8
    proto-fd-max 15000
}

logging {
    # Log file must be an absolute path.
    file /pathto/log_file {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode mesh
        port 3002 # Heartbeat port for this node.

        # List one or more other nodes, one ip-address & port per line:
        mesh-seed-address-port other_node_ip 3002
        interval 250
        timeout 25
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace ssd {
    replication-factor 2
    memory-size 50G
    default-ttl 0 # 30 days, use 0 to never expire/evict.
    stop-writes-pct 95
    high-water-disk-pct 75


    # Warning - legacy data in defined raw partition devices will be erased.
    # These partitions must not be mounted by the file system.
    storage-engine device {
        # Use one or more lines like those below with actual device paths.
        file /pathto/data_file
        filesize 700G
        # The 2 lines below optimize for SSD.
        scheduler-mode noop
        write-block-size 128K

        # Use the line below to store data in memory in addition to devices.
        data-in-memory false
    }
}

#4

Can you clarify what type of device is /pathto/data_file? Is it a rotational disk?


#5

We have an SSD of size 800GB on each node. 700GB is allotted to aerospike in each node.


#6

Hi yestapea,

We are trying to reproduce this scenario.

Can you also tell us -

  1. How many secondary indexes do you have ?
  2. What is the average number of unique secondary index keys per secondary index ?

Thanks


#7

Hi pratyyy,

Around 90-150 M (M=million) of them have secondary indexes. This is across 6 sets with 3M,15M,20M,25M,32M,60M keys respectively. Im not sure if we had a secondary index for the 60M table (and hence the range of 90-150M). Each secondary key would map to around 2-10 records (with mean and median around 5)

I removed the secondary indexes for many of these and moving to new sets with reverse mappings. Nevertheless I would like to know why is this slow.


#8

Hi yestpea,

When you reboot aerospike, it issues a namespace scan to populate all the secondary indexes. This scan will do 1 random i/o per record. In your case it is 300M. Ideally it should not take 27 hours to scan the namespace and populate all the secondary indexes. So there is something wrong going on here. We are trying to figure out this. So far, I have failed to reproduce this locally.

From what I understand you have around 6 secondary indexes in 6 different sets, right ?

FYI : You can use following command in aql to know the number of sindexes you have.

aql > show indexes

To move further can you tell us -

  1. What is your average record size ?

  2. What type of data you were indexing (string or integer). If string, what was the average size of the string.


#9

Hi pratyyy,

All the sets with secondary indexes have string key, string values. Length of both key and value have length around 40-50


#10

Hi yesteapea,

Just to make sure if I am understanding it right, your record size is around 100 bytes.

Lets segregate the time taken to scan the namespace from time taken to update secondary index. So can you tell us the time taken to scan the namespace ?

Thanks


#11

From what i understand the first set of logs I mentioned in my original post is scanning namespace for creation of primary indexes for the sets and the second set of logs are while creation of secondary indexes. Can you check the logs and confirm if that is the case?

(above for primary index creation/namespace scan)

(above for sec index creation)

I dont know how to figure how much time in the secondary index creation is scanning and what part of it actual secondary index creaiton. Can you tell me how to find that out?

Or you meant to ask the total time for primary index creation and secondary index creation?

Total time for primary index creation : ~ 30 mins

Total time for sec index creation : ~ 26 hours


#12

Yes, that is correct.


#13

The first line represents the time taken to iterate the disk and then populate primary index. In this case server will read the disk serially. I am looking for the time taken to scan the namespace. In this process server will iterate through primary index and read the disk randomly.

To calculate this time you can issue a namespace scan. http://www.aerospike.com/docs/guide/scan.html The simplest way to run a scan is through ‘aql’

Run aql in mute mode to avoid printing the output on screen.

aql -M

aql> set timeout 1000000

aql> select * from namespace_name