Storing queryable tags against a user profile


#1

Hi all,

I trying to evaluate aerospike for the following scenario (which I think is pretty common in the adtech world). I have a large number of user profiles (several hundred million) and one of those profiles has tags associated to this user, such as country, city, DoB, last_login, coat_size, times_seen etc etc (min 2-3 tags, max 40-50 tags).

I am trying to create the model such that we can run queries like: Get all users that (live in the UK and are at least 30 years of age and wear an X-L coat) OR have logged in today (this is a silly query but you get the point).

The way I m thinking about it is to make a user_profiles table that holds all of that information as separate bins and its accessible by the users unique id, and also introduce a user_tags table where there key is tag name, a bin called tag value (secondary index) and another one called user_id.

So I make a call (or many) to retrieve the individual keys before going to the user_profiles table to get their profiles.

Is this the best way of going about it?

Thanks


#2

Kinda broad topic, best approach depends on quite a few key questions IMHO:

  • What kind of throughput on user attribute changes are you looking for? What throughput on queries?
  • What is your memory budget for this a.k.a can you afford the secondary indices (40-64 bytes per value)?
  • What’s the selectivity of queries? And for multiple filters: can you find 1 filter that reduces result set by far?
  • What latency requirements do you have (offline/online)? Is the output cacheable? If so, can you pre-compute?
  • Could you spend development resources on application-logic to implement it (higher efficiency)
  • Do you need sorting or grouping features.

We are thinking on concepts for something quite similar over here. However, I think it’s fair to say that such queries can be considered a very challenging task for any NoSQL-DB.

Cheers from Germany, Manuel


#3

Thanks for your response. To answer your questions

What kind of throughput on user attribute changes are you looking for? What throughput on queries?

Let me put it simply. The user makes this query and expects to see a number back (the segment size based on his query). I can give it 2-4 seconds and would need to be able to respond back. A large client would have 4-5 million profiles and each with 15 attributes (but in extreme cases can be 40). The overall pool of user profiles will be 200-300 million.

What is your memory budget for this a.k.a can you afford the secondary indices (40-64 bytes per value)?

Yes-ish. If it’s the last resort we can consider it.

What’s the selectivity of queries? And for multiple filters: can you find 1 filter that reducing result set by far?

Yes - the client id (reduces the size from 200 mil to 3-4 mil as explained above)

What latency requirements do you have (offline/online)? Is the output cacheable? If so, can you pre-compute?

No pre-computing unfortunately cause the queries are unknown. We dont have to answer with consistent data right this millisecond but this doesnt help given we cannot pre-compute.

Could you spend development resources on application-logic to implement it (higher efficiency)

Sure.

Do you need sorting or grouping features.

No - Not needed.

Thank you!


#4

Your requirements seem to be quite relaxed.

The simpler approach and the one you should first evaluate/benchmark IMHO is to use secondary indices with built-in query() feature. Queries only benefit from 1 secondary index filter - therefore highest possible selectivity is desirable. You can apply more filters later on but initial stream of documents that are read from storage is determined only by set and up to one secondary index filter. (Source: http://www.aerospike.com/launchpad/query_multiple_filters.html Abstract “Discussion”).

From what you said it might be possible that you have one “set” (a.k.a table) per client (up to 1024 sets allowed with AS) and one further drilled down index filter like last_seen. If you can get from 3M down to 500k it’ll be a huge gain (partitioning ftw). I’d tend to go with a time value for the secondary index thing but only you know what’s best for your app.

Assuming you are running on SSDs, you might be able to speed this approach further up by configuring “data-in-memory” which will keep a read-only copy of your data in memory that could be used on server-side for that secondary filtering stage. With 200M records of 1K size (?) you’ll end up at 200 GiBs. If you keep secondary indices for the 2-3 most commonly used filters, that will add another 25 GiBs or so for secondary indices. Doable. Depends on how big your customer records are.

However, you’ll have to benchmark this approach with the different configurations and see whether this can fit your time window of 2-4 secs. I remember I once heard an rule of thumbs that queries can evaluate 200k entries/sec on bare metal per node, but please don’t trust this number at all.

To sum it up. Your records would look like this (pseudo names - might be too long!):

Set: ClientX_CustomerSet  Key: Customer_1234567
Bins: {
  "lastseenbin": 123456789,		// Each set has an own secondary index tree on this value.
  "agebin": 30,				// .. and maybe another one on this one too. Should have atleast 1 useful index for any possible query.
  "locationbin": "UK",
  "coat_sizebin": "XL",
  // ... dynamic set of tags and values as bins (per-customer)

  // and your normal data for customer (or lookup from another record by a 'FK' here):
  "name ": "John Doe",
  "email": "johndoe@example.org",			
} 

Pseudoquery:

Aggregate yourFurtherFilterFunction(queryparams like age=30,location=UK) on yourNamespace.ClientX_CustomerSet where lastseenbin between 120000000 and 199999999

Don’t be scared by the lua code examples in the article linked above. You’ll see it is a perfect fit once you socialize with the philosophy behind this powerful feature.


An alternative approach is to keep your own kind of index with large ordered lists (“LList”) but this is a lot more work though it might perform better in the end. It requires your app to manually do the bookkeeping. You have records like “ClientX_CustomersAge30” and inside 1 bin that contains a list of all customer IDs (record digest or self-autoincremented integer uids) with that value. When you want to perform a query, you get (“scan” or “find_N”) such a list and e.g. merge it with entries from a list representing all customers with coatsize XL. You get the idea. After filtering took place, you batch_get details for all customers you want to display on page 1. Ranging() is possible too on lists, but you need unique keys for the sorted large ordered lists so you might have to become creative (like fusing an age value (8 bits) with random 56 bits).

This approach is kinda limited on the amount of throughput since you have write amplification (might have to touch 50 lists in your case). That’s why I asked for the expected throughput on changes… This works better for static data sets. It’ll be a pain to manage from the app-logic but it might allow for drastically faster responses and higher query throughput. Not sure when we will have a chance to implement this and comment on it’s performance…

But the general answer is: it’s possible to do such things but it’ll require some effort / creativity with any NoSQL-DB that hasn’t been fully optimized for these queries in their designs.

Hope this input is helpful. The team might be able to tell you more about expected throughput with the first approach or whether I got something wrong from all the doc materials out there. With your approach I see a problem because you are mixing up secondary index values from all bins/dimensions (?). Biggest problem is that an secondary index can either contain integer or string values, not both at the same time. You’d have to create a record for every value per customer as well, resulting in wasting primary index memory (64 bytes per record!). This way you’d easily use 512B-1K per customer just for indices. Or did I get your idea wrong here?

Cheers, Manuel


#5

Hi Manuel - Thanks for your very detailed and helpful answer. My idea was pretty much what you have explained in your alternative solution but I may have explained it wrong or not enough.

However - the problem with this approach thinking about it now is that you cannot do things like WHERE Last_Seen < 3 months ago (range queries). Am I wrong?

I will look into the main solution you recommend - didnt even think it could perform well enough to consider


#6

With both approaches you can. For the LList approach, see http://www.aerospike.com/docs/guide/llist.html. Instead of scan() you use range() with parameters ( start = time.Now()-90days and end = time.Now() or better use far future value like 2^52 - consider that to be your MAX_UINT64 with lua limitations). The only issue is that by default you are only allowed to insert unique values - so epoch/unix timestamp is a problem you have to workaround. But e.g. if you use 64-bit nanosecond timestamps you are unique 99.9999999999999999…99% of the time (trust me, it works).

Record ClientX_CustomersByLastSeen {
  "llist-bin": [
    { "key": "1239812398122389123", "foreign_key": "Customer_12345" },
    { "key": "1239812398124449199", "foreign_key": "Customer_67899" },
    { "key": "1393199391939102001", "foreign_key": "Customer_1093" },
  ]
}

llist.range( 12398123981240000000, 14000000000000000000) gets you 2nd and 3rd entries.


#7

Whats the algorithmic complexity of llist range? Should I be afraid to use it? Am I losing the edge AS provides? thx


#8

But actually Manuel this is not gonna work. How do you store the Coat Size? You got S, M, L, XL. You can’t have unique value therefore.


#9

It’s tricky, but possible. Which approach is better depends on how randomly spread your own values are. I’ll use strings below but it would be better to use integers / bit masks:

List values can be like “SizeX_CustomerId123”. This way if two Customers have SizeX their value is still unique. If you don’t have a customer id or something similiar to use, you can use random or timestamp-like values. Modeling with NoSQL is not so clean like one might be used to from relational world, but it works in real world! You have to design around the limitations of whatever NoSQL-DB you are evaluating.

(EDIT) Note: If the values would be distinct - like a favorite color: black, blue, green, red, … - you HAVE to seperate indexes per value. It’s smarter to query a list of customers liking black and one liking red instead of getting all customers (a range would give you blue and green). Might be smart with coat size too.

About the runtime: ofc you end up with atleast linear complexity for this (most significant: largest item count with a given attribute value). I see no way around that. With the LList-approach, I would be confident to hit the 2-4 second deadline if you use something like Go or C to do the client-side filtering. With secondary indices and aggregate queries, I’m not sure about the runtime in your case because it depends a lot on the data set itself. I should mention that if you deploy this approach, you use higher throughput because like I said, you have write amplification. You can hide that “index maintenance stuff” from the user, though so he won’t notice.

All together, I see no other way than these 2 approaches if you want to stick with AS for everything.

BTW: Feel free to contact me by private message or email (manu@zendri.com) if you’d like to talk in depth about this solution. Like I said, we are working on something similar which is scheduled to be deployed in the first half of this year. We have harder requirements, though.

Cheers, Manuel