I am looking for database solutions to solve an existing problem with our current dataset.
I have 90 million users, each user has a collection of attributes, there’s 1000+ separate attributes, but each user probably has 40-80 assigned. Each attribute has a set TTL some for hours, some for months. I’m looking for a solution to provide me all attributes for a given user, factoring out expired data.
It seems that the ideal pattern would be to create a record for each user, and attach the attributes as bins. However, TTL is only on a record level, so this method will not work. I see two options:
-
Create a unique set for each user, with one record for each attribute, each record having a single bin with the name of the attribute. This would create 90 million sets with 40-80 records each, about 4.5 billion rows.
-
Create records with two bins, a unique userID and a single attribute. Then place a secondary index on the userID bin. This would create one set and an index with 4.5 billion rows.
What’s better for performance? I cannot find any details on query performance when specifying a set alone. Or if having something like 90 million sets would cripple the system. Or if an index of 4.5 billion rows is too big.
Some background if it helps: We are currently using a 5-node Cassandra cluster, and we do have it working, sorta, but the queries to get the data are complicated, and our biggest issue is that to get it to work we needed to translate everything to a string, so we lost all typing on the data.