Paging through entirety of large dataset

I’m working on a machine learning project doing Bayesian classification based on text of tens of millions of documents. I need a high-performance cache summarized documents so I can pull it en masse and pass to the Bayesian classifier.

The document summaries score each ngram from the document and look something like this: { canonical_id: 123, terms: { ‘Quick’: 4.3212, ‘Quick brown’: 10.2039, ‘brown’: 6.2343, ‘brown fox’: 12.39343 } }

In many cases the documents just have several words (so maybe 10-30 entries in terms dictionary), but some might be longer with 100’s of entries in the dictionary.

When querying the data I’ll be paging through the entire dataset in in large batches (10,000 - 100,000) without any query filters, and my main concern is maximum performance when retrieving each batch. I’d be using the python aerospike client.

My questions are:

  • Has anyone used aerospike for this type of application? At a glance it seems like the focus on hybrid RAM+SSD would be very useful for large data sets that can’t fit in RAM. Obviously benchmarking against my own data set is the only way to know for sure how it will perform, but still interested other folks’ experience.
  • What is the best approach for paging through an entire data set? At a glance it looks like getting max and min IDs and doing a series of range queries or doing batch queries and passing in lists of IDs could both work, but not sure about pros/cons of each approach in practice.

Aerospike does not have range queries on primary. So to be able to do lookup in range you have to create secondary index. But given that you already know your ID, as I understand from your problem statement, batch is also an option.

Query (with secondary index) would be much better at utlizing IO parallelism which SSD provides so in terms of latency it should be better than batch, but flip side would be you need DRAM for secondary index (which like primary also is stored in RAM only)

– R