How to handle large lists: use LDT or index?


#1

We are new to Aerospike - so far our experience has been very positive! We are going to be using aerospike for updating about 50M member profiles while the users visit the site. We also want to then sync that data to another data store for additional processing/reporting. On any one day we might update 1M records and potentially update each one several times throughout the user’s visit. We would like to create a queue/list of profiles that need to be updated and at some time interval query the updated records from aerospike and copy the data to the other store.

The two ways I could think to do this are to create a bin for each record called “was updated” (or similar) and then, when I want to run the sync, query the members where was_updated was set. Read the records (I assume in a scan), save to the new store, then update the was_updated bin for those records so I won’t process them again.

The second possibility is to create a Large Data Type (list) with the keys of the records to be updated and then scan through the list and remove entries after processing.

Basically what I’m not sure of is 1) can the LDT handle large numbers of small entries (eg 1M) well and 2) will scanning an indexed bin be performant at this size and 3) which would perform better?

We also use a solution that will work for a 10-100x growth - the site is growing and if this works well, we will start using it for other data structures (content, and logging)


#2

Yehosef,

LDT can handle large number of small entries but is NOT suited for your cases as it will create single record hotspot for your entire logic and also it cannot scale out.

I would suggest you use secondary index for this use case. It is better to have a bin with “touch_timestamp” instead of was_updated (was update is low cardinality index) … and use range query to fetch all the records between certain time interval for sync. You need not worry about resetting the was_update bit also.

Every write should simply set the “touch_timestamp”.

– R


#3

@raj - thanks for the tips! very helpful.

I want to be able to scan through the documents that need updating because I’m assuming I’ll want to distribute that. so I need some bin that I can figure out that it was updated so I don’t do it twice. Perhaps I could use another bin for saved_timestamp and then get all the records where the saved_timestamp is not set or is less than the touched timestamp.


#4

Yehosef,

Record update is actually Record updated at time T. And was record updated query actually if record got update after certain time T’.

So if you store T you know if the record got update after/before certain time T’.

But whatever suits you best :smile:

– R


#5

Hi Raj,

To clarify, there are two writing processes happening here. We have the write that’s happening to Aerospike - that will be happening often. Then we want to copy that to the other datastore (elasticsearch) approx. every hour and I want to know which records have not be synced yet so I can copy them. After they are synced, I don’t want them to get processed again until the AS record is updated again.

So that’s what I meant by “saved_timestamp”. I think for clarity I might call them “updated_timestamp” (written to AS) and “sync_timestamp” (written to elasticsearch). If sync_timestamp is not set, it’s never been copied over and needs to be. If sync_timestamp is less than updated_timestamp, it means that there was a new change to the aerospike record from the last time it was copied to elasticsearch and it needs to be copied. After processing each record I would set the sync_timestamp to the current time and then it shouldn’t get copied again until there is an update.