Hi all aerospikers ; - )
I’m in the process to choose a nosql db for a new project and I’m looking for some suggestions/ideas to best modelling my problem with aerospike, that so far seems the best solution.
I mainly have a big number (25 millions) of small entries (total around 150Gb), that reside in RAM (complete scenario is with a set of cluster).
I’ve already a working prototype where I index these entries via a secondary index, piped to a map-reduce algorithm (wrote in lua+c) to filtering more the resulting entries.
The filtered entries (let me say, less than 1%) are used to retrieve additional datas recorded on a SSD disk, to narrow even more the search (applying a similar algorithm used in the first step)
My scenario is 100% read (write just at startup), first step is linear search on secondary index on RAM + map/reduce and second step is a (random) search from data (retrieved in first step) on SSD disk
So far i’ve been able to model my first namespace as a RAM table and a secondary namespace as DEVICE (SSD) table.
The first step seems fast enough for my goal.
Is there a good way to speed up the second step (reading randomly from SSD additional data), maybe sending in parallel more than one request (or maybe having a unique namespace with a table in RAM and another table on SSD - is it possible? seems not!)
Now i simply cycle the first step results, retrieving the primary key from RAM and using it to retrieve the additional data in the second table that reside on SSD.
Is there somewhere any entry point on the documentation that can address my problem?
Thank you!
Angelo
Hi Angelo,
I think you are looking for the batch reads to parallelize reads to a set of known keys: http://www.aerospike.com/docs/guide/batch.html. This scales very well with the amount of devices and nodes available because distribution of records with Aerospike is very random (no hot partitions) no matter what key you use. Enjoy
Cheers,
Manuel
Hi Manuel, yes this is really what I was looking for!
Moreover (thanks for patience!), is there any way to “transform” the results of this batch to a stream, so I can use the same/similar lua (mapreduce) used in the first step?
As far as I know the streamUDF can be just applied to a query on a secondary index.
Is there any way to hack this step? Any pointer will be more than welcome
Thanksssss!
Angelo
UPDATE: I forgot to mention that maybe a way to push calcolation to server side can be a parallel execution of RecordUDF but seems that “Aerospike currently does not support Record UDF on Batch result”
Hi Angelo,
I haven’t seen anything like that in code or documentation (defining a set of keys as stream input). However, the following might help you achieve that:
Typically you would model a secondary index to contain all the records you want to apply the UDF on (e.g. for all purchases you have a numeric bin ‘customer id’ with a secondary index on it. Then you can start a stream by querien for purchases with customer id = 123) but that might not fit all cases. So if you can fit your ‘batch’ into something that can be expressed by a query on secondary index (in-memory), you can explore those (advanced) features of Aerospike. One index can hold all customer ids and you define a single value or a range and a query will return all the records having a matching bin value (for stream input or raw return to client).
If not, you could file a feature request. It might not help you out on the short term but in my experience the team has been very open to feature suggestions from developers - especially if you can provide a scenario description where other features can’t help you out.
Cheers,
Manuel
Just to update the conversation, this has already been discussed as a feature request, “it” being applying UDFs to batch-reads. It’s not on the roadmap, but there’s agreement it’s a ‘nice to have’.
1 Like