Fastest way to count records returned by a query

yannispsarras · January 12, 2016, 5:58pm

Hi all,

We have just started looking at using Aerospike for some of our data storage and had a question that I can’t seem to find in the docs.

If I have a

SELECT * FROM user_profile.west WHERE last_activity BETWEEN 340 AND 345

How do I get the number of rows returned without actually getting the rows? Do I need to do a full aggregation query?

Thanks

rbotzer · January 12, 2016, 6:11pm

That’s an aggregation, so you’d have to do it with a stream UDF.

yannispsarras · January 12, 2016, 6:17pm

Hey - This would be half impossible as the query is very dynamic and I won’t have the filters in the filter step.

rbotzer · January 12, 2016, 6:42pm

How would it be impossible? Let’s take an RDBMS for comparison - if the number of records in a query is changing rapidly, then the count is also changing with it. Each time you query it would be different. No difference there. A COUNT() aggregation function is executed server-side as well.

In the case of the secondary index we don’t yet have native aggregation functions, so you need to implement it via a stream UDF. For your case you don’t need a filter at all. You give it the BETWEEN predicate, and the records matched by it in the secondary index stream through the UDF. All you need to do is have a simple mapper that returns 1, and a reducer to further sum it up.

I have a set sp with a bin i, and a secondary index over it.

CREATE INDEX test_sp_i_idx ON test.sp(i) NUMERIC

I inserted 100 records with consecutive values for i.

The stream UDF is in a module named aggr.lua

local function counter(record)
  return 1
end

local function sum(v1, v2)
  if type(v1) == 'number' and type(v2) == 'number' then
    return v1 + v2
  else
    return 0
  end
end

function count(stream, password)
  return stream : map(counter) : reduce(sum)
end

I’ll use a simple Python script to call it

from __future__ import print_function
import aerospike
from aerospike import predicates as p

client = aerospike.client({ 'hosts': [('192.168.119.3', 3000)]}).connect()
#client.udf_put('aggr.lua')

s = client.query('test','sp')
s.where(p.between('i', 1, 50))
s.apply('aggr', 'count', [])
r = s.results()
print(r)

The result is

[50]

rbotzer · January 12, 2016, 6:48pm

There’s also an example of GROUP BY [HAVING] in the Python client: aerospike-client-python/stream_example.lua at master · aerospike/aerospike-client-python · GitHub

yannispsarras · January 12, 2016, 6:51pm

Wow - I misunderstood and apparently its possible and quite easy to write as well!

Thank you.!

rbotzer · July 29, 2018, 5:04pm

Two and a half years after the fact, there’s definitely a faster way to do this now without using Lua.

The clients now have an option for the query to not return any bins, so just the metadata comes back. It would be faster to iterate over this very small result set on the application side and simply count them there.

Topic		Replies	Views
UDF performance Tuning	4	2435	June 22, 2015
Find Top N record from stream User Defined Functions (UDF)	3	2626	January 23, 2015
Querying data based on MAPKEYS index and applying filters in UDF secondary , udf , stream , map	6	2337	August 20, 2017
Proposed solution for pagination in nodejs Client query , udf	0	1269	October 9, 2017
Poor streaming UDF performance Tuning secondary , udf , index	4	2119	January 1, 2018

Fastest way to count records returned by a query

Related topics