Use Case
We use Aerospike as the data source behind our tracking system. When a user goes to our application, we create a non-expiring session. For reporting purposes, we create in a different system an overview of the average timespan for which a session was used.
For other trackings that we do, we create and update pre-stored aggregation sets. Those sets are always per hour. Whenever a client then requests those aggregations, we sum them up before we handover the result.
Data Model
To store a tracking for the mentioned report, we make use of the following (simplified) data model.
Key
-
session-id
: string (from tracking raw data)
Bins
-
first-action
: integer (timestamp which holds the first action of the session) -
last-action
: integer (timestamp which holds the last action of the session)
Data Model Sequence
Due to how the infrastructure is set up, it’s possible that trackings might come in with different sorting. This means that a tracking received by our systems first might get written to the database as second or third.
first tracking received at 1608101020176
-
session-id
: ‘some-session-id’ (create new record) -
first-action
: 1608101020176 -
last-action
: 1608101020176
second tracking received at 1608108020176
-
session-id
: ‘some-session-id’ (update existing record) -
first-action
: 1608101020176 -
last-action
: 1608108020176
third tracking received at 1604108020176
-
session-id
: ‘some-session-id’ (update existing record) -
first-action
: 1608101020176` -
last-action
: 1608108020176`
Report Calculation
In order to generate a report of the created records, we make a query-stream and create an average of (last-action - first-action) over all existing records.
Issue
While this system is working totally fine for ordinary amounts of trackings (~ 3 million raw trackings needs 2-3 seconds), we’re experiencing long query times or even timeouts when there are a lot more trackings (75 million raw trackings needs ~150 seconds).
Question
Our question is now how do we solve a situation like this. We assume that the data model is not a perfect fit for the use case. However, we also don’t know how to solve the issue otherwise. This leads us to the main question.
What is the best practice design pattern for a use case like this?