Realtime aggregation of click data


#1

We’re planning to use Aerospike for storing our clickstream data. As we operate similar fashion to ad tech companies, we must scale our data quite fast with low latency.

Suppose, think our clickstream set has the following bins

user id, country, platform, campaign, cost, timestamp

And we’re inserting 1000 records/sec. (billions/month)

What we should be aware to make queries like these in realtime and on a large scale:

count clicks where country = US

count clicks, sum cost where platform = mobile and campaign = 1209

sum cost where timestamp > last month

Also, any advice on database design for ad tech, big or small are welcome


#2

This is a complex problem. You either have the effort with every insert/update/delete operation or with every query. First approach can be achieved with inverted indexes. Most big companies use asynchronous logic to maintain these indexes because write amplification can be quite high if you have many bins/dimensions that get indexed and most importantly, it’s easier to design in a failsafe way than synchronous indexing. This can be implemented with your own logic with AS. Dynamic GROUP BYs or SORT BYs on dynamic columns on a large scale remain a hard to solve challenge.

The second approach is supported with out-of-the-box features in AS with the secondary index and query-features. You basically select records to check with 1 dimensional filtering (use most selective!) and AS will apply a UDF against every record that can check for dimension 2 till n. This, however, is not hard real time. It can take dozens of seconds or even longer to finish (O(N) where N is amount of matching in dim1). Performance can be scaled with amount of nodes, as all nodes do this work in parallel.

Cheers, Manuel


#3

This is what druid from metamartket does best.