Which solution is the fastest at large scale?


#1

I don’t know exactly which operations are faster than others, so I’m not sure what’s the way to go here…

Here’s the case: I need to store events to be sent several hours later to a bunch of users. I don’t need to store any data for these events, only a time and a recipient. I also occasionally need to be able to delete these events to prevent them from being sent. I don’t need to store more than 1 event per user. I run a cron scrip every minute that will recover the messages to send for the current time stamp / 60 (only changes every minute) from Aerospike, and send the messages to the received “recipient” values.

Adding and removing messages for users will happen a lot more often than reading the messages for a specific timestamp, but reading will result in bigger activity spikes because they will all be recovered at once.

I have 2 solutions in mind :

  • 1st solution would be to save in memory a record with the “user_id” as PK, a bin named “recipient” containing the same user id, and a bin named “time_to_send” containing the divided timestamp value, and then create a secondary index for that “time_to_send”. In order to add or delete a message, I just need to put a new record or delete the one corresponding to the user_id key. But I’m concerned about the speed of the query on the secondary index when I read the messages…

  • 2nd solution would be to do the opposite: Save in memory a record with the “time_to_send” as PK, and a list of “recipients”. I would use a UDF to append and remove messages from the list. But I’m concerned about the speed of execution of the LUA script.

Thanks in advance.


#2

PK operations are always the fastest and scale the best. Depending on how you access and need the data, and the volume, we need to model your data differently in some cases.

What is the maximum number of users you expect to be tied to a given time slot at any given time, represented as bytes?

Also are there any other use cases you have for this data, such as other ways to access or view it later down the line?

I’m thinking you could also use a PK of the date and have a Map stored in that PK with all the users. http://www.aerospike.com/docs/guide/cdt-map.html No UDF required. You could then batch get the users out of that map and perform inserts/deletes as necessary. This of course assumes the data in the PK doesn’t exceed 1MB though.


#3

I expect several megabytes per time slot. What do you mean by “This of course assumes the data in the PK doesn’t exceed 1MB though”? I thought the data in memory was not subject to block limitations…


#4

I also found this in another topic:

This is interesting… From what size is it better to split the record in 2 to increase performance? As soon as I hit 1MB? 5MB? 20MB?

Also, I tried to find the documentation for using Map operations in the PHP client… I think it’s due for an update. The documentation is a year old and doesn’t include any information about maps… Except this: https://github.com/aerospike/aerospike-client-php/blob/master/doc/aerospike_ldt.md

Which is full of broken links…


How to use Map operations using the PHP client?
#5

Ah I didn’t realize you were all in memory only. In that case you should give it a shot I think


#6

Please don’t use LDT as it is deprecated and being removed. You already can store more than 1MB of data since you are in memory only. As for the php documentation, I don’t have any experience using that.

Regarding your comment on large data sets, I don’t think this is the case since you will be consuming all of it at one time anyway.

At any rate, I’d suggest trying it several different ways and benchmarking the differences. For secondary indexes, the cost is memory and scalability.

Hoping someone else can weigh in on data modeling and php here


#7

I asked on the PHP client forum, but haven’t got any reply so far… So in the meantime, I made a UDF:

function set(rec, bin, key, value)
    local m = rec[bin]
    if (m == nil) then
        m = map()
    end
    m[key] = value
    rec[bin] = m
    if aerospike:exists(rec) then
        return aerospike:update(rec)
    else
        return aerospike:create(rec)
    end
end

function del(rec, bin, key)
    local m = rec[bin]
    if (m == nil or m[key] == nil) then
        return 1
    end
    map.remove(m, key)
    if(map.size(m) == 0) then
        return aerospike:remove(rec)
    else
        rec[bin] = m
        return aerospike:update(rec)
    end
end

At least, this works, and I can try to bench this against the other solution.


#8

After benchmarking the different solutions

  • Using a UDF for adding and removing events in a map
  • Putting and removing a record for each event and use a secondary index to query them.

I realized a few things:

  • Adding/removing a record is twice as fast as Adding/removing a KV to a map using a UDF.
  • Records take twice as much space as the map in the DB
  • The query on the secondary index is quite fast. It takes my test server 100000 entries for the search to take more than 1 sec.

So since I need the Add/Remove operations to be as fast as possible, and that I only read the event lists once in a while, I’ll go for secondary indexes… The additional space taken by the records isn’t that big.


#9

Again, map operations do not require udf as far as I know. But it is a fairly new feature. I still think it’s worth exploring


#10

I asked on the PHP client forum how to use map operations and rbotzer told me they were not implemented yet. I will run the benchmark again when it’s done, see if it’s any better than using UDFs.