Handle time based queries

range
index
udf
query

#1

We came across a case, where we want to retrieve data from a time series. Let say we have time based data : [“t1-t2” : {data1}, “t2-t3” : {data2}, “t3-t4”:{dat3}]

With the above kind of data, we would want to look up exact data w.r.t time. For example, for a given time t1.5, the data has to come as data1, and for t2.6 it should come as data2.

To solve the above problem, we are planning to store the data in a sorted map in aerospike as mentioned below {“t1”:{data1}, “t2”:{dat2}, “t3”: {data3}}

When a client asks for t1.5, we must return data1. To achieve this, we implemented a UDF at the server level to do a binary search for the nearest and lowest value for the given input (i.e t1.5), which will return t1 's value ,i.e data1.

Is there a better way of achieving the same, as it incurs cost at server level for every request. Even UDF to do a binary search requires loading all the data in memory, can we avoid it?


#2

Thinking aloud… Storing t1-t2, t2-t3 is redundant on t2. Just store t1, t2 is inferred from next key:value. { t1:data, t2:data, …} - store key sorted You must know max difference between any ‘t1’ and ‘t2’ Build secondary index on MAPKEY and type numeric (this essentially does the bulk of the sort work for you upfront in the RAM) Search for records where t between t-maxdiff and t+maxdiff ==> a set of few records and pass these to your UDF. Invoke UDF on these few records subset to return the data. This will be a very simple UDF. Note: A particular UDFs invocations are efficient only up to 128 concurrent executions per node at any give time. Beyond 128 invocations of the same UDF on a given node, the performance will degrade.


#3

One more suggestion…

Seems you have a fixed timeslab (t1-t2) but storing the lower end (t1) as map key. If you store the upper end (t2) as the may key, you should be able to use our getByKeyRange() operation. If you are looking for t1.5, you can use getByKeyRange() and pass t1.5 as the lower bound. You can pass a conservative upper-bound to limit the number of returned values. Otherwise you can leave it as NULL. If you pass NULL, you will get everything above t1.5 In either case, the application can consume the first value and discard the result set.

You can store lower-bound of the timeslab also but the logic will become little bit more complex. Instead of passing your desired value as lower-bound, you have to pass it as upper-bound. Similar to the above, you can pass a conservative lower-bound or NULL. When you get the results, you need to skip all values except the last value.