Data modeling for expiration

pk

#1

Hi everyone,

we’re experimenting with Aerospike – more specifically, we’re trying to move our user data from Cassandra to Aerospike. This data is specific in a way that there are categories associated with users that have to eventually expire. I came up with two possible patterns of doing this in Aerospike but since we’re not experienced with it yet, it’s hard to evaluate which one performs better in the long run. Our user data consists of user ID and categories that describe a user and each category should have TTL associated with it.

These are the patterns we’re evaluating:

  1. Have user ID as PK and categories as list. But since we have to expire separate categories after N days, we would have to store expiration dates for each category with them and I suspect also the earliest expiration date in a separate indexed bin so that a job tasked with a deletion of expired categories would have easier time filtering them out.

This is how schema would look like in this case: PK (user id), categories, expirations, earliest_expiration_date (indexed).

Selecting all the categories for a user would be quite easy this way since it’s only a single get by key but on the other hand this requires more work with expiration and more opportunities for bugs.

  1. Have user ID and category ID as PK with only one category bin and user ID bin. What this means is that we’d have several records for each user, each record having TTL. There can possible be hundreds of records associated with a single user. We would create an index for user ID and then use it to perform a query for all user categories. This pattern would be easiest for us to implement but I have an impression that it is not the most suitable for Aerospike.

The schema would look like this: PK (user_id + category), category, user_id (indexed).

Here, we would store several records for each user so we would always have to query for all of them but the advantage is that all the expirations would be handled by Aerospike.

Which pattern offers the best performance and is more suitable for Aerospike?

If you have any question, please let me know.

Thanks for your help!


#2

This is a great question. Short answer is that it would depend what you are trying to optimize for and how many total records you would end up having in each case (which would inform the amount of RAM needed).

  • Write/Update performance would be better for the second option, as records would be smaller, and, if you have to update the records, you can use the replace flag and always pass in the whole record (category + user id) to avoid reading the record first. In the first option, adding a category or removing one from the list would require reading the whole record first before rewriting it (Aerospike takes care of this for you if you don’t provide the whole record and pass in the replace flag).

  • Read performance should be better for the first option as you pointed out since reading a single record will always be faster then doing a secondary index query that would have to go to all nodes in the cluster.

  • Obviously, again as you pointed out the second option is much easier for handling expirations. The second option would also allow Aerospike’s eviction mechanism to be efficient (if it comes to that).

I’ll let other folks share their opinion, but if your top priority is performance to get all categories for a user ID then you have to go with option 1. That will be for sure the best performing. And will use less RAM.

If you can sacrifice a bit on performance, then the second option should work pretty well but you will have to do some benchmark to really quantify how big a sacrifice this is…