Query versus Scan


#1

Usage

The difference between the usage of scan and query is the usage of a predicate. A scan does not make use of a “WHERE” clause where as query makes use of it.

Scan

aql> select * from test.testset

Query

aql> SELECT a, b FROM test.testset WHERE b = 123

Internals

For a job of retrieving all the records in a set, a scan crawls (or reduces) all the partitions of the namespace for which the node owns the master copy and selects the records belonging to the set. The scan internally looks at the meta-data of each record and identifies whether a record belongs to the set or not. For a scan across a whole namespace, the records of all master partitions the node owns for the given namespace will be returned. But in the case of a secondary index query, the matching records are selected directly through the secondary index, in memory, before their values are looked up (either in memory or on persisted storage, based on the namespace’s storage engine configuration).

Influence of result set size on performance

As described above, regardless of a set size, when performing a scan, the whole namespace has to be crawled. Therefore, the time taken for scanning a short set is dependent of the size of the namespace it belongs to. In the case of a secondary index query, the time taken is proportional to the number of records returned.

In general, a secondary index query is at least as fast as a scan for a given result set size.

Scan or Query?

Consider the following scenario where a user considers introducing a secondary index bin into each record of a set, with all of its values equal, in order to query on this bin to retrieve the entire set. The interesting question is about the difference in performance between such query and a normal scan for the set.

Answer

Benchmarking the two scenarios on a individual cases basis would reveal the exact performance results but in general, even in this case, since the secondary index identification and corresponding primary index detection are both in memory, getting records via a secondary index query should be faster than scanning all the partitions to retrieve the records belonging to the set and the difference will be as dramatic as the namespace to which the set belongs is large.

It is also important to point out some of the current weaknesses of secondary indices:

  1. A fast restart would slow down as it would require rebuilding the secondary indexes again by scanning all the records values.
  2. Will use memory. It is also harder to size secondary index memory.
  3. Could slow down some write transactions (replace transactions, as well as prole write transactions would require the previous value to be looked up in order to clean up the secondary index).
  4. Potential inconsistencies during migrations as partitions may move around right during a secondary index query (this behavior holds true for scans). This has been filed as an improvement request AER-3291. The query-pre-reserve-partitions option can help reduce the inconsistency window for some use cases.
  5. Secondary Index Garbage Collector not clean up allocated memory if use-case involves deleting and re-adding records with different bin values (with data-in-memory false), before secondary index garbage collection runs through. This has been filed as an improvement request AER-1126.

Keywords

Secondary index sindex scan limitation memory leak

Timestamp

10/04/2017


FAQ - Scans in Aerospike
#2

Was this based on my article? :0


Aql select command encounters timeout issue when another scan job for different set is running
#3

Thanks so much, this article is very useful to me.