How to find all records by multi keys in secondary index


#1

Hi

I’m using aerospike community 3.12.1 and spring-data-aerospike 1.0.2.RELEASE

I want to find all records by list of key in secondary index.

Recently I realized sping-data-aerospike does not support “IN” clause.

Is there any other way?


#2

This really depends on how many secondary index keys you are working with and the selectivity of those keys. If you have only a few keys and these each will return a relatively large number of records, use a query for each of the keys.

On the other hand, if you have a large number of keys and/or these map to a small number of records, you may be better off creating another set which uses this index field as the primary key. Then you could do batch reads from this set.

You can do this either using the Java client directly or using Spring Data.


#3

Thank you for reply.

To be ore specific, I made the following Set structure.

@Data public class Reaction implements Serializable { @Id private String mainKey; private String postMappingId; private String userId; private String reactionType; private LocalDateTime createDate; }

And I created secondary index for postMappingId and userId.

At first, find all ‘userId’ list for ‘postMappingId’ is ‘1111’, then I want to find all ‘postMappingId’ list corresponding to that ‘userId’ list.

As you mentioned I can create two sets for postMappingId and userId, I don’t have enough memorys.

Is there really no other way?


#4

Depending on the size of your data, is it possible to fit your data for a userid in a single record (1MB limit on persistent storage - SSD or HDD including overhead)? ie a userID and all its possible postMappingIDs and associated reactionType and createDate?

bin1 - userID bin2 - Map of Maps, as: {postMappingID : {reactionType: s1, createDate: ts1}, …} where postMappingID is its values such as ‘1111’ in your example above.

On bin2, create MAPKEY index (on postMappingID), find all records that match using a SI Query and the data you need is all in bin1 and bin2.


#5

Thank you @pgupta

In fact, all data is stored in the RDBMS(MySQL) as follows.

CREATE TABLE reaction ( post_mapping_id varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL , user_id varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL , type varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL , create_date datetime NOT NULL DEFAULT CURRENT_TIMESTAMP , PRIMARY KEY (post_mapping_id,user_mapping_id), KEY RCT_IDX_01 (user_mapping_id), KEY RCT_IDX_02 (create_date) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

And the table is separated by range partitions. Also there are a billions of data.

In this case, if I make sets as you suggest, Can I get a response in a few seconds? and how?

I’m sorry keep asking.


#6

To estimate the flattened data in the map, how many postMappinIDs will be associated with one useriID - what is the max number?


#7

Theoretically, it is as many as the number of postMappingId.

But in my sample data the max number of postMappingIds associated with on userIds is about millions. Roughly.


#8

One million or many million of postMappingID? Earlier you said total records are billions? How many billions? If one million postMappingID, and one billion records, implies same postMappingID is associated with up to 1000 userids max per your data structure. How many total userids actually? If postMappingIDs are in mllions, we have to design the data model differently. What I proposed will not work. We need some good bounds on these numbers to solve this problem.

Are you storing these records in RAM, SSD or HDD? What is the size of RAM and, if SSD or HDD, size of persistent storage you have? How many nodes in your cluster? The reason I ask is because you said you were limited in memory.

Finally how long do these records stay relevant? Is there a time to live -ie a time span after which they should be automatically flushed out of the database or they should be kept forever? If so, at what rate do you expect the data to grow? ie what data capacity should we shoot for?


#9

I can not tell directly which service it is (F***b***), I want to analyze all user reactions of the service. So we have a set structure with postMappingId and userId.

As you might expect, the number of postMappingId and userId will be very large, If the number of reactions itself has been around 60 days in the past, I expect that it will be between hundreds of millions and billions. There are not many NoSQL which is supported TTL and clustering for this size of data, so Aerospike is considered.

As you can see from the first question, the size of data, the number of nodes and so on are not a problem right now. (The service itself will be located in AWS and the instance type expects r3.2xlarge. Data storage will be stored in ram + ssd. data-in-memory true )

I am currently testing Aerospike, but the following two features are not directly supported:

  1. where “IN” clause

    • I want to retrieve all the data for several keys in the secondary index field. The IN method is not yet supported in spring-data-jpa, and the direct method is not supported in the aerospike client.
    • I wonder how to use it if lua module is available.
  2. Group By

    • There is no such thing as group by in aql.
    • Now we are calculating the group by directly using the java stream.

Except for the above two functions, the result of PoC to date is very satisfactory. Of course, it is different from RDBMS, but I hope that count (), group by, IN, NOT IN etc. are directly supported in aql etc.

In conclusion, what is most needed now In the secondary index field, how can I retrieve records for multiple keys at once?


#10
public Record[] batchGet(Key[] keys, String[] binNames) throws AerospikeException {
        Record[] records;
        records = client.get(client.getBatchPolicyDefault(), keys, binNames);
        return records;
    }

Above is java code, the client is AerospikeClient, if you can get the primary key, you don’t need to use secondary index field.


#11

In secondary index field Can I use aerspikeClient.batchGet() method?

Because The bin that I want to use can’t be primary key.


#12

for primary key, like that, records = client.get(client.getBatchPolicyDefault(), keys, binNames); the key is batch policy, for index field, you only can use query and filter instead of get, following is my implementation, not official, not cover all cases, just for suggestion.

   public RecordSet queryRecords(String namespace, String setName, String filterBin, String filterValue, String[] bins) {
        Statement statement = new Statement();
        if(bins != null) {
            statement.setBinNames(bins);
        }
        statement.setNamespace(namespace);
        statement.setSetName(setName);
        statement.setFilters(Filter.equal(filterBin, filterValue));
        return client.query(null, statement);
    }

    public RecordSet queryRecords(String namespace, String setName, String filterBin, long lhs, long rhs, String[] bins) {
        Statement statement = new Statement();
        if(bins != null) {
            statement.setBinNames(bins);
        }
        statement.setNamespace(namespace);
        statement.setSetName(setName);
        statement.setFilters(Filter.range(filterBin, lhs, rhs));
        return client.query(null, statement);
    }

#13

http://www.aerospike.com/apidocs/java/com/aerospike/client/AerospikeClient.html#query(com.aerospike.client.policy.QueryPolicy,%20com.aerospike.client.query.Statement) The official document, please check query, statement and filter


#14

This method seems to be able to retrieve a record for only one key. Is there any way to get all the records in a single call with multiple key conditions at once?


#15

it can get all bins or several bins , please check statement.setBinNames(bins); it seems not set that fields will return all bins, please check.

the limitation is currently only support filter one bin, in other words, like AQL or java client, that “where” only support one bin, but don’t support several columns, like where columnA = “XXXX” and columnB = 'YYYY"


#16

Thank you. I will try


#17

For example, it support in aql and client select * from ns.set where colA between 1 and 10 not support multiple conditions in different secondary indexes like following, select * from ns.set where colA = 5 and colB=7


#18

Sorry one more question.

The multiple key I mentioned is not column A and column B.

The multiple key is List<Column A> for secondary index.

In this case, is it possible to search by that method?

For example in SQL SELECT * FROM reaction WHERE userId IN (‘A’, ‘B’, ‘C’ … )


#19

http://www.aerospike.com/docs/guide/query.html if the bin is integer type, you can use range query aql> SELECT * FROM user_profile.west WHERE last_activity BETWEEN 340 AND 345

if the bin is String type, only support equal, the same as java client. http://www.aerospike.com/docs/client/java/usage/query/query.html


#20

Unfortunately, it is not an integer type and it is not sequential.

So now I’m querying one by one and processing it as a stream. The process is too slow because the size of the List is so large, even if processed asynchronously.