Fastest way to count records returned by a query

query
aggregation

#1

Hi all,

We have just started looking at using Aerospike for some of our data storage and had a question that I can’t seem to find in the docs.

If I have a

SELECT * FROM user_profile.west WHERE last_activity BETWEEN 340 AND 345

How do I get the number of rows returned without actually getting the rows? Do I need to do a full aggregation query?

Thanks


#2

That’s an aggregation, so you’d have to do it with a stream UDF.


#3

Hey - This would be half impossible as the query is very dynamic and I won’t have the filters in the filter step.


#4

How would it be impossible? Let’s take an RDBMS for comparison - if the number of records in a query is changing rapidly, then the count is also changing with it. Each time you query it would be different. No difference there. A COUNT() aggregation function is executed server-side as well.

In the case of the secondary index we don’t yet have native aggregation functions, so you need to implement it via a stream UDF. For your case you don’t need a filter at all. You give it the BETWEEN predicate, and the records matched by it in the secondary index stream through the UDF. All you need to do is have a simple mapper that returns 1, and a reducer to further sum it up.

I have a set sp with a bin i, and a secondary index over it.

CREATE INDEX test_sp_i_idx ON test.sp(i) NUMERIC

I inserted 100 records with consecutive values for i.

The stream UDF is in a module named aggr.lua

local function counter(record)
  return 1
end

local function sum(v1, v2)
  if type(v1) == 'number' and type(v2) == 'number' then
    return v1 + v2
  else
    return 0
  end
end

function count(stream, password)
  return stream : map(counter) : reduce(sum)
end

I’ll use a simple Python script to call it

from __future__ import print_function
import aerospike
from aerospike import predicates as p

client = aerospike.client({ 'hosts': [('192.168.119.3', 3000)]}).connect()
#client.udf_put('aggr.lua')

s = client.query('test','sp')
s.where(p.between('i', 1, 50))
s.apply('aggr', 'count', [])
r = s.results()
print(r)

The result is

[50]

#5

There’s also an example of GROUP BY [HAVING] in the Python client: https://github.com/aerospike/aerospike-client-python/blob/master/examples/client/stream_example.lua


#6

Wow - I misunderstood and apparently its possible and quite easy to write as well!

Thank you.!


#7

Two and a half years after the fact, there’s definitely a faster way to do this now without using Lua.

The clients now have an option for the query to not return any bins, so just the metadata comes back. It would be faster to iterate over this very small result set on the application side and simply count them there.