Using Aerospike to make a top X query (pageviews or scoreboard)

nfo · January 21, 2015, 2:10pm

There seem to be no way to make a a top X query (top pageviews, a scoreboard, etc).

A simple use case is the following: I receive beacons from browsers, and I need to compute the top browsers (or countries) by pageviews for a particular day.

I thought about using a Large Ordered List. They record key would be “browsers-20150121” and it would be contain a LLIST. This LLIST would contain objects with two fields, the browser name and the pageviews for this browser.

Two problems here.

Inserting: Apparently I can’t update my LLIST efficiently when a beacon arrives. I would have to scan the whole list to find the object (with find()) matching the browser, make a +1, and then update() the object.
Querying: I can’t ask for a top X from 1 to 10, then 11 to 20, etc.

Did I miss something ? Is there another ways to tackle this kind of problem ?

nfo · January 21, 2015, 6:24pm

I found another way. It should work well if the cardinality of values is not too high.

Storing records with the key “20150121-browsers-chrome”. For the bins: “pageviews”(int) and “value”(string). In this case the value is “20150121-browsers”. A secondary index is created on the “value” field.

Inserting: The bin “pageviews” is incremented easily with the multiops API.
Querying: to get the top browsers for the January 21st, I query on the secondary index, and then sort by pageviews via the Stream UDF.

So this solution is ok for a low-cardinality segment (like browsers). A lot less for high cardinality segments like URLs or cities.

In the end, I just wanted to post a feature request about adding more functions to the LLIST API. It’s “Ordered”, but the current API does not seem to put this feature forward.

bbulkow · January 21, 2015, 6:33pm

Hi nfo —

You’re on the right track with LLIST, but the API is “cryptic at best”. We’re refreshing our doc on this as we speak — but I’ll set you on the right track.

In order to efficiently insert in the case you’re using, you’ll insert a map object (the native Map type of the language, like a Map in Java or a dict() in Python).

The server will look inside the map and use the key “key” as the indexed item. When you add() you will efficiently insert into the llist. The supported types currently integer and string (you’ll want integer for a scoreboard).

In pseudocode: m = { “key” : 420, /* myscore */ “browser_name” : “foobrowser” } cluster.add(“namespace”, “set”, “browsers-20150121”, m )

Perhaps someone else will respond with a better example.

Regarding your second way: Yes, that works too. Secondary indexes will always be far less efficient than accessing a single key. In that case, you’ll ask different servers for their ranges (since we’re a distributed system), and do a storage op for each row (if using SSD).

Accessing a single key per scoreboard is more efficient in overall work because there is one network event that goes to one server on a read, and replicated on a write. Using a secondary index is quite efficient on writes in Aerospike because you insert a new record and update the secondary index on that server, but reads will be a little less efficient because you query multiple servers and merge on the client.

But you can do one or the other.

manigandham · January 22, 2015, 3:45am

We have this setup in a general namespace used for some basic analytics:

Set: “browsers”

Key: “2015-01-21”

Bins: “chrome” => 1223, “firefox” => 453, “ie” => 323 (bin names are the choices, values are the counts)

For more custom stuff, we add prefixes to the keys “somesite.com-2015-01-21” or just create a new set to store different type of metrics.

128k record size can hold a fair bit of values like this and is small enough that we just pull back entire records and do the ordering of top values client-side.

Topic		Replies	Views
Data Modeling 101! (Data modeling for specific use-cases) How Developers Are Using Aerospike secondary , index , sort	1	2387	December 13, 2014
Find Top N record from stream User Defined Functions (UDF)	3	2639	January 23, 2015
Storing queryable tags against a user profile Data Modeling	8	3692	January 14, 2016
How would one implement a sorted list like Redis Node.js Client	6	4218	March 13, 2015
How to compute top X in last Y days? How Developers Are Using Aerospike query , java , aggregation , index	1	1113	June 8, 2020

Using Aerospike to make a top X query (pageviews or scoreboard)

Related topics