How to implement "count" with "group by" and multiple filters efficiently

bobrnor · June 26, 2019, 2:47pm

Hi! We use AS for a long time and now we stuck on implementing complex count query.

What we have:

260.000.000 of records (350Gb on SSD) with such bins: owner_id, end_date, message_type, is_read, message_params.
7.000.000 unique “owner_id” and secondary index on it.
3 nodes.

What we need:

Few times per day we should put 1 “message” per owner_id and calculate count like: “SELECT COUNT(1), message_type FROM table WHERE owner_id= ‘some-owner-id’ AND end_date > NOW() AND is_read=0 GROUP BY message_type”

Right now for each owner_id we query all items with end_date > NOW() and do the rest “in code”. We do it in 200 threads so it leads to very hight IO Wait. In fact we read nearly all data.

So the question is how we can do that count operations without huge number of io reads. Maybe we can configure indexes in the way to do the “count” in memory. Like index on owner_id + end_date + is_read + message_type.

Have no idea… need help.

pgupta · June 26, 2019, 5:35pm

Creating separate records whose primary key is the combination of interest and keep the count in parallel is certainly one way to do it if you don’t want to scan all the records and aggregate counts later either in client or via a stream UDF.

Topic		Replies	Views
Counting distinct (unique) Query & Indexing query , secondary , udf , aggregation , index	4	6196	August 16, 2014
queryAggregate still not implemented in Ruby Ruby Client	7	941	April 2, 2019
Fastest way to count records returned by a query query , aggregation	6	6461	July 29, 2018
Using large maps/tables in stream User Defined Functions (UDF)	1	1254	April 2, 2015
Issue with Multiple threads executing queryAggregate Query & Indexing	4	1831	October 29, 2014

How to implement "count" with "group by" and multiple filters efficiently

Related topics