We’re using Aerospike to store user data, each record has key as user_id and value is a record with multiple bins.
One of the bins with name Aud is of type list that stores the audience segments that this user belongs to. Aud bin has substantial data (currently in KBs) and it could grow in future.
For serving an impression to a user, we
- Fetch the user’s record i.e all bins for that user to AdServer
- Using the segments in the Aud bin, we evaluate the Deals applicable to the user i.e user_id => deals
- Do further processing
Evaluating
user_id => deals
mapping is expensive because
- Aud bin has heavy payload and so incurs Network Cost to fetch that data.
- The algorithm requires a lot of compute to evaluate the mapping i.e Compute Cost at AdServer.
- Both 1 and 2 are being done in the hot-path/online i.e during impresion serving path making it a hotspot.
Hence we think that this is a good candidate for offline evaluation i.e we would create a new bin called Deals which would be “somehow” populated i.e we would evaluate the user_id => deals offline and put it in the new bin Deals. To ensure that the mapping is fresh, we need a Deals TTL as well which would hold the expiry of this mapping.
So the New Schema would look like this
Where bins Deals and Deals TTL are the new bins being populated.
We are planning that after evaluating the mapping, AdServer would write the mapping to Deals bin and update the Deals TTL bin to a future time, say 1 hour in future from the present time. The Deals TTL bin acts like an expiry time for Deals bin. So this optimizes 2(see above)
To ensure that we optimize the network cost i.e fetch Aud bin only if Deals are not expired, we are thinking of using UDF. The UDF would implement the following pseudo-code
Read Deals TTL bin to get expiry time
if (TTL Expired) {
send map {
"Aud":[Audience Segments],
"Bin(3)":Data,
...,
"Bin(N)":Data
}
} else {
send map {
"Deals":[Applicable Deals],
"Bin(3)":Data,
...,
"Bin(N)":Data
}
}
Note that either Aud or Deals bin is being sent conditioned on Deals TTL and the current time, remaining bins are being sent “as is”.
Questions :
-
Will UDFs scale for this use-case. Looks like the perfect tool to me. We aren’t using any UDFs currently. We are using the C client library in AdServer and one another module (not part of AdServer) also uses Golang client library. Is getting a record as a map using UDF more expensive than simply fetching the bins? If yes, why?
-
Can I do this using Aerospike Operations Expressions instead of UDF?
-
How can we reliably measure the performance impact if any on a test cluster? I believe Aerospike does some level of sandboxig of these UDFs for safety, will that have adverse affect on performance? By impact we mean the additional overhead in terms of CPU/memory and latency imposed on Aerospike due to this UDF vis-a-vis the existing approach and not just the client side latency (which we would be able to measure).
-
Consider this alternate approach → What if we write a background UDF which SCANs the entire set and implements the logic to deduce the
user => deal
mapping? Is UDF the right tool for such a job which would be scheduled say every 1 hour. Note that we have billions of records for users.