Aerospike for large objects - LDT & LLIST (looking for alternatives to MongoDB and S3)

llist
ldt

#1

Currently we are using MongoDB & mostly S3 for our large object’s requirement. Also our data is structured in a key-value format. So, I was looking for some fast alternatives & I came across Aerospike. I am satisfied with respect to the research & information I have gone through. But still I have some basic queries -

  1. How large object is supported in Aerospike? (I have tried Large List Data Type). Does the performance degrade as the object size increases?

  2. Will Aerospike be able to perform at ~ 50 million keys having these large objects (>15MB) as values OR will there be any care/concern that we will have to take? What is the largest Production deployment of Aerospike? (How many nodes? How much data?)

  3. If I start Aerospike node with lets say 100 GB disk & if it is getting full. How does the ‘adding node’ (sharding) for Aerospike work?

  4. Will the backup utility (asbackup, asrestore) work where data is > 500GB?


#2

@ameykpatil,

Regarding your third question:

  1. There is no need for manual sharding. When you want to add capacity, simply add a node to the cluster; it will automatically rebalance to include the new node.

#3

If your namespace is configured to use SSD you are limited by the write-block-size (at most 1MB). If you’re using data-in-memory without persistence you can store larger objects. Be aware that such large objects will run into networking related slowdowns, as they’re communicated between the client and server cluster.

Using LList will not necessarily help, because a single object within the LList is still bound by the write-block-size. What happens with LList is that the list is implemented across multiple physical records (subrecords), but an object in the list can at most be contained in one physical record.

It makes more sense for you to chop your objects into multiple parts, each of which fits in a record, then use either batch reads or queries to fetch the components, combining them on the client-side. You can also chop them into Large Ordered List objects and assemble them in your application when you get the record.

For (3), read the architecture and distribution article to understand how the cluster rebalances data automatically as it grows.

Regarding (4), asbackup can handle any namespace or set as long as you have enough disk space for the backup files it generates.

However, my real question is why would you store such incredibly large records in any database. Things such as files should live in a CDN or served up by webservers tuned to delivering files (stripped of scripting). Databases aren’t too efficient at this type of work.


#4

@Mnemaudsyne Thanks, I read about it on a weekend, now I am clear about Aerospike cluster & rebalancing.

@rbotzer I am aware of the write-block-size, but I think LList is best suited for my use case. Let me explain in brief. So I want to store user-id & his followers’ ids, now followers can be in millions (in case of twitter). I was currently storing it in S3 with some of the S3 files having size > 15MB. A single object within LList will be just an id & not a large object. But overall LList’s size might go beyond 15 MB. Also with S3 I was fetching all the ids & then iterating over it for lookup, with Aerospike-LList I guess I can directly lookup without fetching all the ids. I have a feeling that I will greatly improve the performance replacing S3 with Aerospike LDTs for above use case. Am I correct?

Also a follow-up question is, Some of the use-cases in my application need map like structure. I have seen in Java client that there is LMap, LSet structures supported but they are not currently rich with functions like LList. I didn’t find any documentation regarding them on Aerospike website.

  1. Are these structures we will get to see in future?
  2. Will they be enriched with all the functions like LList?
  3. Are these structures supposed to be of infinite length like LList?
  4. Why are they not documented if they are supported in server & java client?

#5

@ameykpatil,

Thanks for your post. Let me help clear up the confusion.

Back in February, we consolidated our Large Data Type (LDT) functionality. We used to have LLIST, LMAP, LSTACK and LSET. We kept LLIST, and the LLIST API remained as is. We decided to deprecate LMAP, LSTACK and LSET data types, and no longer support them, either in the server or in the clients.

Indeed, the functionality of LMAP, LSTACK and LSET can be achieved using only LLIST; thus, developers using any of these three types are urged to use LLIST instead.

In short, you can use the LLIST API for LMAP/ LSET.


#6

Hi. I still wouldn’t use LDT for this use case, or S3.

If you’re trying to track followers/following I would use a set followers with two user ID bins in it, originid, followerid and add as many denormalized bins as you want to add (username and follower name, for example). You then build a secondary index on originid which lets you easily query for a user’s followers, and a secondary index on followerid to easily query which people a specific user is following.

Queries are going to be faster than getting an entire LDT worth of object. If necessary you can then query for any user by their ID.


#7

This GitHub repo demonstrates how to use LLIST as a stack, map or queue; https://github.com/helipilot50/aerospike-LDT-techniques.git

But many LDT functions can be implemented with standard key value operations and composite keys, and/or queries. Be sure you are using the right tool for the right job.


#8

@ameykpatil,

We’ve made the request that the classes LSet, LStack, LMap be marked as deprecated in the Java client (see Issue #45 on the Java client repo). Thanks again for noticing the inconsistency.


#9

@Mnemaudsyne Thanks for clearing all the doubts, regarding LMap/LSet structures. Going ahead with LList now.

@helipilot50 Thanks for the link, will check it out.

@rbotzer You gave another perspective to my problem. But with your solution there is going to be too many records or keys. (Imagine 50 million users. On an average 1000 followers). Also you missed the point that I will never require to get all the followers in my application at once with LList, I can simply maintain cursor and fetch in batches as and when required, which was not possible with S3. If we say “getting an entire LDT” is not a requirement, is there any other reason for not using LDT?


#10

@rbotzer: Sorry for stealing the question but maybe he mean’t something very simple. Seems to be worth a shot:

This sounds like you got a perfect useless for an LList there! Simply have one LDT per user containing a list of all followers. LList’s can grow to any size. Make sure to use as less bytes as possible (use integers and AS will automatically apply a compact format to it - read about MessagePack or just assume an 32-bit int to have 5 bytes, a 64-bit integer to have 9 bytes). Then setup your LList’s correctly! In your case, you can safely say that no item will even be bigger than 9 bytes. You can read about max_key_size & max_object_size at http://www.aerospike.com/docs/guide/ldt_advanced.html .

LList’s grow to an infinite size.

I am assuming that every entry in the list will be just 1 user_id (not e.g. a map or list, which is more complex to store). You will still have a record per user and keep the meta data there (if possible), so when you iterate through the LList you will have to make batch_reads to get e.g. the user’s profile picture URIs and such. That happens on application side. If your common read will go and fetch that data for every entry in the LList, you might wan’t to think about keeping a replica of that data in a map per LLIst-Entry (known as ‘denormalization’).

Main drawback of LList’s seems to be that the curernt implementation needs multiple IOPS per read and write, getting worse the larger your lists gets. This is why it’s very important to make most use of one “block”/“chunk” alias “sub-record” of the tree. This leads to slightly worse latency and throughput characteristics (they are working on it… and yet I haven’t stumbled upon a use case that would hit the limit (of a few thousands read/write OPs / sec per LList). If I Remember right, there was also an a limit on how many clients could iterate through one instance of an LList in parallel (1? 2?). If you don’t need to do that in parallel you’ll be fine.

Cheers Manuel


#11

@ManuelSchmidt you can always steal my questions. :wink:


#12

Thanks a lot @ManuelSchmidt & @rbotzer :relieved:


#13

Guys, sorry to bug you one more time, but I have a small query. Can I have LList with insertion order (element inserted last will be at last index) instead of sorted list? Is there any option?


#14

Currently there is no such thing as an auto-incremented key. I got exactly the same problem, but ended up using a 64-bit timestamp as key to avoid the “unique key requirement” - it works very well though not theoretically perfect (time of insertion used). Only alternative would be to have a bin “auto_id” and your list and use an UDF to insert to list. The udf can atomically increment both the auto_id aswell as insert into the large list. Performance is worse than plain inserts, however.

There used to be “large stacks” but they discontinued those data structures as they can be replaced with the large list.