Geospatial

Hi everyone,

I came across Aerospike a week ago, I read all the documentation and tested it. It has a quite different approach than any other NoSQL implementation I’ve used, but I’m satisfied with this database, great work (except I wasn’t expecting it to be limited by number of nodes, having a shared-nothing architecture).

We are designing an API that leverages geospatial Indexes heavily, so we are considered aerospike for its fast nature and for it’s geo indexes based on S2 (that I believe are very good). As documentation didn’t satisfied my concerns I’m asking about further info about the implementation.

Reading docs, all considerations I was able to do were:

  • They are using S2, producing a CellID (integer) storing it in a SecondaryIndex.
  • When they query an area, they produce the required S2 Cells and make few range queries on the SecondaryIndex.
  • Secondary indexes are colocated with data and data is located according to he Key HASH, therefore queries are made toward ALL nodes.

Here my concern:

If the above is true, this mean that when querying for all points in a specific area there will be multipel range queries (due to S2 Cells produced to cover the area) and those will hit EACH SINGLE node in the cluster. I think that in a query intensive scenario this will become shortly a bottleneck and horizontal scaling will not help at all.

I’m i wrong? Are there other optimizations like S2 cell partitioning that helps improve the performances? Or can be a Geographic point (S2 cell) used as primary index, to partition data using CellID, preventing all nodes to be hit by range queries?

Thanks for your time.

The scatter-gather approach of sending the query and running on all nodes in parallel is not bad (and also not always good). It depends on the query and its result set.

If the data and query are in such a way that the query result will contain a bunch of records and comes from all the nodes, scatter-gather works in our favor. This is the sweet spot of secondary index queries in general in Aerospike. This is mainly due to parallelism of the query execution. Each node will lookup its own secondary index (corresponding to the local data) and return the results. The results coming from all the nodes in the cluster will be merged at the client. In the specific case of Geo area queries, an area will consist of many S2 cellids depending on the resolution. So, searching for all of them in scatter-gather approach is not necessarily bad.

I guess you are alluding to the other way round. It is true that the scatter-gather approach is not good when the result set of the query is very small. If the selectivity of the query is very high, the overhead of scatter-gather mechanism will overweigh. But first I suggest to evaluate for yourself the time taken and then decide further action.

We can consider alternate data models to not fall into this trap. For e.g maintaining a pre-aggregated list (which will be the result) if possible from the application itself. Say if a S2 cell has very few points (say 1-10), and application knows the cellid (or has its own notion of area-ids) may be you can use the S2 cell id (or area-id) as a primary key itself and store the points in the native list data type of Aerospike.

I would further state that if you are trying to do millions of lookups a second, with millisecond accuracy, you are better off running the S2 libraries in your application code, and having them call Aerospike as a key-value store. This allows you to fine-tune the layers you wish to load, and having different accuracy at different points on the globe.

If you are interested in 10’s of thousands or 100k qps, we’ve found the way Aerospike has built-in the S2 library works great.