Data Model for Average Session Duration

fschaeffler · December 16, 2020, 9:24am

Use Case

We use Aerospike as the data source behind our tracking system. When a user goes to our application, we create a non-expiring session. For reporting purposes, we create in a different system an overview of the average timespan for which a session was used.

For other trackings that we do, we create and update pre-stored aggregation sets. Those sets are always per hour. Whenever a client then requests those aggregations, we sum them up before we handover the result.

Data Model

To store a tracking for the mentioned report, we make use of the following (simplified) data model.

Key

session-id: string (from tracking raw data)

Bins

first-action: integer (timestamp which holds the first action of the session)
last-action: integer (timestamp which holds the last action of the session)

Data Model Sequence

Due to how the infrastructure is set up, it’s possible that trackings might come in with different sorting. This means that a tracking received by our systems first might get written to the database as second or third.

first tracking received at 1608101020176

session-id: ‘some-session-id’ (create new record)
first-action: 1608101020176
last-action: 1608101020176

second tracking received at 1608108020176

session-id: ‘some-session-id’ (update existing record)
first-action: 1608101020176
last-action: 1608108020176

third tracking received at 1604108020176

session-id: ‘some-session-id’ (update existing record)
first-action: 1608101020176`
last-action: 1608108020176`

Report Calculation

In order to generate a report of the created records, we make a query-stream and create an average of (last-action - first-action) over all existing records.

Issue

While this system is working totally fine for ordinary amounts of trackings (~ 3 million raw trackings needs 2-3 seconds), we’re experiencing long query times or even timeouts when there are a lot more trackings (75 million raw trackings needs ~150 seconds).

Question

Our question is now how do we solve a situation like this. We assume that the data model is not a perfect fit for the use case. However, we also don’t know how to solve the issue otherwise. This leads us to the main question.

What is the best practice design pattern for a use case like this?

neelp · December 20, 2020, 7:56pm

Here are some suggestions:

Try tweaking config parameters scan-max-udf-transactions and single-scan-threads that affect scan and udf performance.
If your device is slow, the performance of the large data set will suffer. Improving the device speed will help.
Improving cluster resources (per node level and/or increasing the number of nodes) will also help.
An alternate data/computational model would involve creating a separate aggregation record (with session-duration-sum and session-count bins) and update it with each session. Compute the average by simply reading the record and computing the ratio session-duration-sum/session-count. This entails more writes, but may be acceptable for your situation. If the aggregation records become a write bottleneck (“hot keys”), techniques like distributing them over multiple records and reading all records to compute the average can be used.

Hope this helps.

fschaeffler · December 22, 2020, 9:51am

I think we’re giving the following data model a shot, as we did already do some test-drives for the other alternatives.

neelp · December 22, 2020, 5:03pm

OK, the pre-processing prior to the final aggregation ought to help improve the bottleneck in the final step.

system · March 16, 2021, 5:03pm

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Context-aware Aggregation in LUA Aggregation lua , udf	4	956	November 14, 2019
Slower than expected performance on aggregations Tuning	6	3155	June 10, 2015
Storing tabular/PSV data in Aerospike Data Modeling query , spark , hadoop	2	1931	September 30, 2015
How to compute top X in last Y days? How Developers Are Using Aerospike query , java , aggregation , index	1	1116	June 8, 2020
UDF to query/visualize data over time User Defined Functions (UDF)	0	929	May 18, 2020

Data Model for Average Session Duration

Related topics