Paging through entirety of large dataset

matthull · February 17, 2015, 7:00pm

I’m working on a machine learning project doing Bayesian classification based on text of tens of millions of documents. I need a high-performance cache summarized documents so I can pull it en masse and pass to the Bayesian classifier.

The document summaries score each ngram from the document and look something like this: { canonical_id: 123, terms: { ‘Quick’: 4.3212, ‘Quick brown’: 10.2039, ‘brown’: 6.2343, ‘brown fox’: 12.39343 } }

In many cases the documents just have several words (so maybe 10-30 entries in terms dictionary), but some might be longer with 100’s of entries in the dictionary.

When querying the data I’ll be paging through the entire dataset in in large batches (10,000 - 100,000) without any query filters, and my main concern is maximum performance when retrieving each batch. I’d be using the python aerospike client.

My questions are:

Has anyone used aerospike for this type of application? At a glance it seems like the focus on hybrid RAM+SSD would be very useful for large data sets that can’t fit in RAM. Obviously benchmarking against my own data set is the only way to know for sure how it will perform, but still interested other folks’ experience.
What is the best approach for paging through an entire data set? At a glance it looks like getting max and min IDs and doing a series of range queries or doing batch queries and passing in lists of IDs could both work, but not sure about pros/cons of each approach in practice.

raj · February 18, 2015, 4:38am

Aerospike does not have range queries on primary. So to be able to do lookup in range you have to create secondary index. But given that you already know your ID, as I understand from your problem statement, batch is also an option.

Query (with secondary index) would be much better at utlizing IO parallelism which SSD provides so in terms of latency it should be better than batch, but flip side would be you need DRAM for secondary index (which like primary also is stored in RAM only)

– R

Topic		Replies	Views
Profiling / Optimizing Aerospike Batch Reads Tuning query , modeling , map , index	0	2562	October 2, 2015
Scan or batch? Which one is better and when?	1	1062	April 9, 2020
Performance Woes: is there documentation on data modeling in correlation to querying methods? Data Modeling	1	1969	December 16, 2014
Pagination in fetched result set from Aerospike (AER-5474, AER-6193) Delivered Requests	14	8104	June 16, 2020
Using RAM and SSD namespaces in a common scenario Data Modeling	4	2126	January 29, 2016

Paging through entirety of large dataset

Related topics