Modelling large data structures


#1

Hi,

We are trying to use aerospike with the following use-case, and are looking for feed-back/recommendations as to how we are thinking of modelling the data.

In our use-case, we have a large number of “entities” (initially several hundred thousand, but potentially millions). Each entity has an arbitrary number of “data-contexts” associated with it, and each data-context is essentially a binary blob (potentially quite large, perhaps 50-100Kb each).

The basic process is that an entity and all of it’s data-contexts are retrieved from aerospike, processed by the application, and the entity and any modified data-contexts are stored back.

We expect a very high rate of queries/updates across the data set, but a small number of concurrent accesses to a single entity.

What would be the most efficent way to achieve this? We have experimented with passing the data-contexts and record generation count as parameters to a record UDF. This UDF would check the records generation count to prevent concurrent modification, and use an LDT bin to store chunks of each data-context in a map. We have run into an issue with this approach where it seems the servers Lua cache eventually uses all available memory and crashes the node, possibly because of the amount of data we are passing?

Some additional questions that have been raised are:

  • Are LDT bins suitable for frequent reads and updates?
  • Are there any performance guidelines or benchmarks available when passing large amounts of data to and from Aerospike?

#2

LDT’s are not really recommended, and you don’t need them for that size of data.

Store a record for each entity, with any general information you need, and use a list or map bin that stores the primary keys of all the associated data context. Then store the data-contexts as individual records, perhaps in another set for organization.

You can do 1 operation to lookup the entity record and set a value on a bin as a lock, then read all the primary keys for the data-contexts and fetch them all efficiently with a batch get. Do your processing and store them all again and remove or unset the lock bin value on the entity record.

You can also use a secondary index and store the entity’s primary key as a value on all the data-contexts so you can retrieve them through a query, this will scale better if you have thousands (or more) of data-contexts for an entity as your records are limited in size (defaults to the write block size).