How to handle large lists: use LDT or index?

yehosef · June 23, 2015, 9:02am

We are new to Aerospike - so far our experience has been very positive! We are going to be using aerospike for updating about 50M member profiles while the users visit the site. We also want to then sync that data to another data store for additional processing/reporting. On any one day we might update 1M records and potentially update each one several times throughout the user’s visit. We would like to create a queue/list of profiles that need to be updated and at some time interval query the updated records from aerospike and copy the data to the other store.

The two ways I could think to do this are to create a bin for each record called “was updated” (or similar) and then, when I want to run the sync, query the members where was_updated was set. Read the records (I assume in a scan), save to the new store, then update the was_updated bin for those records so I won’t process them again.

The second possibility is to create a Large Data Type (list) with the keys of the records to be updated and then scan through the list and remove entries after processing.

Basically what I’m not sure of is 1) can the LDT handle large numbers of small entries (eg 1M) well and 2) will scanning an indexed bin be performant at this size and 3) which would perform better?

We also use a solution that will work for a 10-100x growth - the site is growing and if this works well, we will start using it for other data structures (content, and logging)

raj · June 24, 2015, 5:15am

Yehosef,

LDT can handle large number of small entries but is NOT suited for your cases as it will create single record hotspot for your entire logic and also it cannot scale out.

I would suggest you use secondary index for this use case. It is better to have a bin with “touch_timestamp” instead of was_updated (was update is low cardinality index) … and use range query to fetch all the records between certain time interval for sync. You need not worry about resetting the was_update bit also.

Every write should simply set the “touch_timestamp”.

– R

yehosef · June 24, 2015, 7:56am

@raj - thanks for the tips! very helpful.

I want to be able to scan through the documents that need updating because I’m assuming I’ll want to distribute that. so I need some bin that I can figure out that it was updated so I don’t do it twice. Perhaps I could use another bin for saved_timestamp and then get all the records where the saved_timestamp is not set or is less than the touched timestamp.

raj · June 24, 2015, 8:29am

Yehosef,

Record update is actually Record updated at time T. And was record updated query actually if record got update after certain time T’.

So if you store T you know if the record got update after/before certain time T’.

But whatever suits you best

– R

yehosef · June 24, 2015, 9:45am

Hi Raj,

To clarify, there are two writing processes happening here. We have the write that’s happening to Aerospike - that will be happening often. Then we want to copy that to the other datastore (elasticsearch) approx. every hour and I want to know which records have not be synced yet so I can copy them. After they are synced, I don’t want them to get processed again until the AS record is updated again.

So that’s what I meant by “saved_timestamp”. I think for clarity I might call them “updated_timestamp” (written to AS) and “sync_timestamp” (written to elasticsearch). If sync_timestamp is not set, it’s never been copied over and needs to be. If sync_timestamp is less than updated_timestamp, it means that there was a new change to the aerospike record from the last time it was copied to elasticsearch and it needs to be copied. After processing each record I would set the sync_timestamp to the current time and then it shouldn’t get copied again until there is an update.

Topic		Replies	Views
Modelling large data structures Data Modeling	1	1979	July 23, 2016
Aerospike for large objects - LDT & LLIST (looking for alternatives to MongoDB and S3) Aerospike and other Databases llist , ldt	13	5943	September 28, 2015
Timeseries Data structure question Data Modeling	4	3822	September 5, 2015
How do I store the equivalent of Redis? PHP Client Library	7	2704	October 22, 2015
Request: Java client LargeList async methods Rejected Requests ldt	2	1731	March 19, 2018

How to handle large lists: use LDT or index?

Related topics