Using Aerospike to make a top X query (pageviews or scoreboard)

llist
query

#1

There seem to be no way to make a a top X query (top pageviews, a scoreboard, etc).

A simple use case is the following: I receive beacons from browsers, and I need to compute the top browsers (or countries) by pageviews for a particular day.

I thought about using a Large Ordered List. They record key would be “browsers-20150121” and it would be contain a LLIST. This LLIST would contain objects with two fields, the browser name and the pageviews for this browser.

Two problems here.

  1. Inserting: Apparently I can’t update my LLIST efficiently when a beacon arrives. I would have to scan the whole list to find the object (with find()) matching the browser, make a +1, and then update() the object.

  2. Querying: I can’t ask for a top X from 1 to 10, then 11 to 20, etc.

Did I miss something ? Is there another ways to tackle this kind of problem ?


#2

I found another way. It should work well if the cardinality of values is not too high.

Storing records with the key “20150121-browsers-chrome”. For the bins: “pageviews”(int) and “value”(string). In this case the value is “20150121-browsers”. A secondary index is created on the “value” field.

  1. Inserting: The bin “pageviews” is incremented easily with the multiops API.

  2. Querying: to get the top browsers for the January 21st, I query on the secondary index, and then sort by pageviews via the Stream UDF.

So this solution is ok for a low-cardinality segment (like browsers). A lot less for high cardinality segments like URLs or cities.

In the end, I just wanted to post a feature request about adding more functions to the LLIST API. It’s “Ordered”, but the current API does not seem to put this feature forward.


#3

Hi nfo —

You’re on the right track with LLIST, but the API is “cryptic at best”. We’re refreshing our doc on this as we speak — but I’ll set you on the right track.

In order to efficiently insert in the case you’re using, you’ll insert a map object (the native Map type of the language, like a Map in Java or a dict() in Python).

The server will look inside the map and use the key “key” as the indexed item. When you add() you will efficiently insert into the llist. The supported types currently integer and string (you’ll want integer for a scoreboard).

In pseudocode: m = { “key” : 420, /* myscore */ “browser_name” : “foobrowser” } cluster.add(“namespace”, “set”, “browsers-20150121”, m )

Perhaps someone else will respond with a better example.

Regarding your second way: Yes, that works too. Secondary indexes will always be far less efficient than accessing a single key. In that case, you’ll ask different servers for their ranges (since we’re a distributed system), and do a storage op for each row (if using SSD).

Accessing a single key per scoreboard is more efficient in overall work because there is one network event that goes to one server on a read, and replicated on a write. Using a secondary index is quite efficient on writes in Aerospike because you insert a new record and update the secondary index on that server, but reads will be a little less efficient because you query multiple servers and merge on the client.

But you can do one or the other.


#4

We have this setup in a general namespace used for some basic analytics:

Set: “browsers”

Key: “2015-01-21”

Bins: “chrome” => 1223, “firefox” => 453, “ie” => 323 (bin names are the choices, values are the counts)

For more custom stuff, we add prefixes to the keys “somesite.com-2015-01-21” or just create a new set to store different type of metrics.

128k record size can hold a fair bit of values like this and is small enough that we just pull back entire records and do the ordering of top values client-side.