Complex big data modeling for low latency queries (huge throughput - over 20K read/write per second)

query
modeling

#1

In one set I’ve got over 1b items (key = uuid) which have some attributes (e.g. color, size, cost…).

On the other set I’ve got milions of orders (key = orderId) with some meta data (e.g. date, client…)

And on the final set I’ve got those 1b items (key = uuid) and bin = orders where I have comma separated orderId’s.

So far everything was ok - the only queries I made was: return all orders for given item.

But now I have to make complex quries like:

  • get count of items for all orders
  • get items that are part of order A and order B
  • get count of red items for all orders
  • get red items which cost is more than $1000 and that are part of order A and order B

All that queries cannot take more than few seconds.

Any ideas? Or maybe AS isn’t the best choice?


#2

Aerospike is extremely, consistently fast for key-value lookups which are the highest priority items in many use-cases it’s used for. Secondary indexes are also fast, with the information about the records being stored in RAM on the server nodes. However, the cardinality of the result set affects the round trip time to the client – obviously returning 1,000,000 records to the client will be slower than returning 10.

That being said, there are ways of using Aerospike to model complex relationships which allow complex queries to execute quickly. The way you model the data is key and makes a huge difference in performance, and really relies on your knowing your Use Case queries up front so yo can model the data in the best way. Denormalizing the data will often help, so you have redundancy which will give you the exact value you’re after.

For example, your first query: get count of items for all orders. If you mean “how many items does each order have?”, you could simply store the count of items associated with each row in the order set.

If you also stored the item keys as a list on the orders, then getting the items that were a part of order A and order B would be simple – just load the 2 orders and compare the list of items. Notice the redundancy here – for the above query we stored the count of items in the list, here we’re storing the list. This is for efficiency of lookup per the particular queries.

A secondary index on the color of the item would allow you to get the count of red items for all orders.

Your final query: get red items which cost more that $1000 and are part of order A and order B could be done by loading orders A and B and scanning through the items in the list.

This assumes the cardinalities of the items per order makes these feasible. If not, there are other ways of modeling the data. Aerospike can perform complex queries efficiently – assuming the data is modeled properly. Secondary indexes can help greatly, as can using aggregate UDF functions to perform additional filtering of data.

One further note about your implementation – it might be worth storing the order ids held against an item in a list rather than in a comma separated string. That way, if you want to use a secondary index to search for a item in the list it becomes feasible. I’m also curious why you have sets which contain the items – why not just have one set with the bins off both sets?

Hope this helps

Tim Faulkes, Senior Solutions Architect, Aerospike