Data Model for Average Session Duration

Use Case

We use Aerospike as the data source behind our tracking system. When a user goes to our application, we create a non-expiring session. For reporting purposes, we create in a different system an overview of the average timespan for which a session was used.

For other trackings that we do, we create and update pre-stored aggregation sets. Those sets are always per hour. Whenever a client then requests those aggregations, we sum them up before we handover the result.

Data Model

To store a tracking for the mentioned report, we make use of the following (simplified) data model.

Key

  • session-id: string (from tracking raw data)

Bins

  • first-action: integer (timestamp which holds the first action of the session)
  • last-action: integer (timestamp which holds the last action of the session)

Data Model Sequence

Due to how the infrastructure is set up, it’s possible that trackings might come in with different sorting. This means that a tracking received by our systems first might get written to the database as second or third.

first tracking received at 1608101020176

  • session-id: ‘some-session-id’ (create new record)
  • first-action: 1608101020176
  • last-action: 1608101020176

second tracking received at 1608108020176

  • session-id: ‘some-session-id’ (update existing record)
  • first-action: 1608101020176
  • last-action: 1608108020176

third tracking received at 1604108020176

  • session-id: ‘some-session-id’ (update existing record)
  • first-action: 1608101020176`
  • last-action: 1608108020176`

Report Calculation

In order to generate a report of the created records, we make a query-stream and create an average of (last-action - first-action) over all existing records.

Issue

While this system is working totally fine for ordinary amounts of trackings (~ 3 million raw trackings needs 2-3 seconds), we’re experiencing long query times or even timeouts when there are a lot more trackings (75 million raw trackings needs ~150 seconds).

Question

Our question is now how do we solve a situation like this. We assume that the data model is not a perfect fit for the use case. However, we also don’t know how to solve the issue otherwise. This leads us to the main question.

What is the best practice design pattern for a use case like this?

Here are some suggestions:

  • Try tweaking config parameters scan-max-udf-transactions and single-scan-threads that affect scan and udf performance.

  • If your device is slow, the performance of the large data set will suffer. Improving the device speed will help.

  • Improving cluster resources (per node level and/or increasing the number of nodes) will also help.

  • An alternate data/computational model would involve creating a separate aggregation record (with session-duration-sum and session-count bins) and update it with each session. Compute the average by simply reading the record and computing the ratio session-duration-sum/session-count. This entails more writes, but may be acceptable for your situation. If the aggregation records become a write bottleneck (“hot keys”), techniques like distributing them over multiple records and reading all records to compute the average can be used.

Hope this helps.

I think we’re giving the following data model a shot, as we did already do some test-drives for the other alternatives.

OK, the pre-processing prior to the final aggregation ought to help improve the bottleneck in the final step.

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.