Timeseries Data structure question


#1

Hello,

I have a question about the best way to organize my dataset.

Currently I have a set of timeseries data with a key like this:

  • 2015:8:15:appid - with 3 bins with ints that are counters.

The way I query is: create all the key’s range, like 2015:8:1:app, 2015:8:2:app, …, and then batch query aerospike.

A way I thought about to be more efficient (theoretically) was, having a record like 2015:app, and then have 365 * 3 types bins, and then query it using bin filter (get record 2015:appid1 and bins from 3 to 40. This way a batch query would not be needed.

Drawbacks: the timeseries would be all in 1 node, and the record would be re-written a lot. I am using stream aggregation to make the sum’s, which means each record would be updated each time.

Is my current solution good? Is there a better way of doing timeseries in aerospike?

Thank you very much!


#2

Data modeling in NoSQL is always about how you will retrieve the data.

The solution you have proposed will work so long as you can always formulate the primary keys to be read. In the latest release of Aerospike (3.6) you will actually a performance boost as the multiple reads will be interlaced.


#3

About your second approach (2015:appid1 with 365 bins):

Looks possible (quick estimation: 36538bytes + overhead per record = ~10kb which should be below block size and therefore not cause too many additional (theoretically wasted) IOPS other than wasting 9.9kb in write cache when all you change is basically 24 bytes.

However, I would say that a large ordered list seems to be the better choice for this usecase. Simply add an LDT (large data type) bin to your App_$AppId - Record that can hold limitless amount of values. If you set the max entry size to 24 bytes you will get a pretty compact form of storing data (much less than your approach). There is built-in support for range queries on this data structure. It’s biggest drawback, however, is that it’s not allowing more than a few thousand updates/sec when deploying on ssd’s and not memory (benchmark that for yourself, on your hardware that you want to deploy with).

@helipilot50:
I hope that in the future LDT’s will be enhanced e.g. by enabling ‘append only’ type of insert-operations that take the shortcut to the last leaf right away from root node (usefull for a lot of stuff, especially time series data - would save SSD read ops like hell, with the benefit of linear runtime as well) and/or generally higher write and read throughput. I wonder if there is anything implementation-wise that would militate against such an operation or whether that optimisation is already applied when using normal insert op.

Cheers Manuel


#4

Thank you for the responses :smile:

@ManuelSchmidt Thank you for the suggestion. When we designed our schema, LDT couldn’t be backed up yet, so we didn’t look at it at that time as it would be a risk. In our solution we are expecting a lot of updates (real-time data). I will benchmark it. However, for storing data (no-updates), it seems great! Updating an LDT doesn’t require a full record re-write on ssd correct? If this is true, partitioning may be a good solution.

As this is core in our business, perhaps no XDR support, might bite us in the future. Are there any plans for this?

Best, Oxy.


#5

First off, i’m not an AS developer/employee so i’m not really an expert when it comes to how they actually implemented something. I just googled up anything I could find about LDT’s when I had to decide on their usage and hope this may help you a bit with your data model.

First off, forget the mentioned 10kb. Been wrong. According to the sizing guide it should be closer to 40kb of record size for a full year. (3653(28bytes bin meta+ 8bytes data)+overhead per record => ~40kb

If my understanding of LDT’s sub-records mechanism is correct, it will not update a block on the ssd but put the whole sub-record (containing a bunch of entries, according to doc’s “ranging from 2kb-200kb” in size) into the LFS system which tries to minify that block writing process using that write cache (not confirmed it actually does for LDT). I assume, that it will have to (persistently) update the super-structure too, but without looking at the code it’s hard to tell whether it’s implemented like that. BTW, that super structure seems to have a size of ~220 bytes. What I can’t tell you is how to make sure your sub-records are always minimal in size. I would ask that to AS engineers with that offering of a free 30 minutes modeling consulting. To me, 40kb vs. 2kb (if possible to enforce) sounds worth the extra IOPS (IF ANY in your case, as small LDT’s fallback to some kind of compact mode, so there should be no tree traversal causing additional read OPs and super structure and the compact list could be stored together in the LDT bin…

That’s as far as my knowledge goes. So far I haven’t dug deeper in the code but the only reason I didn’t use LDTs in the beginning was that backup-issue but that one is fixed by now. Yeah, no XDR support but that isn’t an issue with the community version which lacks that feature anyways. Haven’t heard of plans about supporting that. But I’d guess if you are on enterprise and really want that feature, they would develop it for you.

Cheers, Manuel

P.S.: If you make any findings regarding this whole topic it would interest me too to hear about them. Idk, maybe an AS engineer will reply here aswell at some point.