How to Count Values in a Bin


#1

#Synopsis:

Aerospike does not provide a function to count the number of records in a bin.

You can write a simple lua module, and call the module in a client API to count the records in a bin.

For this exercise, we use the following data set:

aql> select * from DataStore
+------------+-------+---------+
| frequency  |  foo  | recency |
+------------+-------+---------+
| “eleven”   |       |         |
|            | “nnn” |         |
| “thirteen” |       |         |
| “four”     |       | “ddd”   |
| “ten”      |       | “jjj”   |
| “three”    |       | “ccc”   |
| “five”     |       | “eee”   |
| “two”      |       | “bbb”   |
|            |       | “mmm”   |
|            |       | “lll”   |
| “one”      |       | “aaa”   |
| “seven”    |       | “ggg”   |
| “nine”     |       | “iii”   |
| “twelve”   |       |         |
| “eight”    |       | “hhh”   |
| “six”      |       | “fff”   |
+------------+-------+---------+
16 rows in set (0.460 secs)

We want to count the number of values in the ‘frequency’ bin.

Create a secondary index on the ‘frequency’ bin:

aql> show indexes
+-------------+-------------+-----------+----------+-------+-----------+------------+--------+
|     ns      |    bins     |    set    | num_bins | state | indexname | sync_state |  type  |
+-------------+-------------+-----------+----------+-------+-----------+------------+--------+
| “DataStore” | “frequency” | “Profile” |        1 | “RW”  | “freq_1”  | “synced”   | “TEXT” |
+-------------+-------------+-----------+----------+-------+-----------+------------+--------+
1 row in set (0.000 secs)

You could query the secondary index directly, but that solution is only practical for a very small set of data:

aql> select frequency from DataStore.Profile
+------------+
| frequency  |
+------------+
| “eleven”   |
| “thirteen” |
| “four”     |
| “ten”      |
| “three”    |
| “five”     |
| “two”      |
| “seven”    |
| “one”      |
| “nine”     |
| “twelve”   |
| “eight”    |
| “six”      |
+------------+
13 rows in set (0.464 secs)

The solution is to write a simple lua module that performs the aggregation on the bin. Save the following lua code in a file named countNumOfBinsWithValues.lua. The lua code performs the aggregation:

local function accumulate(counter,rec)
  -- Initialze the counter; First time around
  if counter == nil then
    counter = 0
  end

  -- Check if 'frequency' bin is not nil (=has some value) -- if so, increment the counter
  if rec.frequency ~= nil then
    counter = counter + 1 
  end

  return counter
end

local function merge(a,b)
  a = a + b
info(tostring(a))
  return a
end

function count(stream)
  -- Process incoming tuples and pass it to aggregate function, then to reduce function
  --   NOTE: aggregate function 'accumulate' accepts two parameters: 
  --    1) counter  
  --    2) function name 'accumulate' -- which will be called for each record as it flows in
  -- Return reduced value of the counter generated by reduce function merge
  return stream : aggregate(counter,accumulate) : reduce(merge)
end

Use the following syntax to import the module:

$ aql -c “register package ‘./countNumOfBinsWithValues.lua’”

The client tools do not support calling lua modules, so you must call the module from one of the API clients. For this example, we use a Python application. Save the following code in a file called pycli.py:

import the module
from future import print_function
import aerospike
from aerospike import predicates as p

config = {‘hosts’: [ (‘127.0.0.1’, 3000) ]}
try:
    client = aerospike.client(config).connect()
except:
    print(“failed to connect to the cluster with”. config[‘hosts’])
    sys.exit(1)
    
#Query namespace DataStore, set Profile:
query = client.query( ‘DataStore’, ‘Profile’)
#Query bin frequency:
query.select( ‘frequency’ )
query.apply(‘countNumOfBinsWithValues’, ‘count’)

# callback function prints the records as they are read
def print_result(value):
    print(value)

#Execute the query and call print_result for each result
    query.foreach(print_result)

#Close the connection to the Aerospike cluster
    client.close()

The pycli.py application returns the proper 13 records from the ‘frequency’ bin: $ python pycli.py 13