Distributed version control system?

spark

#1

Hello.

Do you know of any “version control system” or sync tool able to work transparently on files and folders stored on a distributed file system or databases (such as Aerospike)? (or any function already included with Aerospike).


#2

Aerospike isn’t a distributed file system, it’s primarily a key-value store.

It seems like you’re trying to combine lots of layers into 1 thing: what exactly are you trying to do?


#3

All right, I didn’t use the proper words, lets say distributed database. I wanted to use a generic word that included things such as Hadoop, Aerospike… but mainly Aerospike.


#4

What are you actually trying to do? It would help if you could describe a use-case as what you’re asking is still too vague/confusing.

Note: Hadoop isn’t a database either, it’s an entire framework for data.


#5

I need to find a framework for scientific computing. I’ve thought of Aerospike + Spark. And I’ve thought it would be a good idea to have some kind of “version control system” for the data stored on that system. Maybe I was wrong wthinking of Aerospike as a file system.


#6

Would it be possible to get the format and size of the data you are trying to store? Also how many previous versions of the data would need to be stored? Aerospike is an In-Memory key value pair/ row store database. It support multi-bin or single-bin records. Its possible that a system could be designed to use an Aerospike multi-bin record configuration to store the scientific data and reserve 1 bin for the versioning/created date. Please see the following for an overview of Aerospike :

http://www.aerospike.com/docs/architecture/data-model.html

and info on work with spark:


#7

It depends on what you mean by “scientific computing.” For instance, a lot of scientific data is actually stored in raw form not in Hadoop HDFS, but in the scientific Heirarchical Data Format 5 (HDF5). (q.v. What is HDF5?)

You could then use Aerospike by itself, or Aerospike + Spark as a metadata store, using it as a key-value to swiftly analyze or look up data mapped to the HDF5 system. HDF5 is typically used for very large “data lakes” of raw data, similar to Hadoop. HDF5 has its origins in research and science (physics, biomedical BioHDF, etc.), whereas Hadoop is traditionally considered more of an enterprise data management system.

It is quite possible to use Aerospike as a fast metadata analytics system, whereas the “raw” data, if it is large raw datasets, as in petabytes, probably works best in a persisted-to-disk data system. Again, you can pick-your-poison in that regard. Either HDFS as part of Hadoop, or HDF5 if you want to use a different set of tools and apps on top of it.

What kind of dataset are you working with, if you don’t mind my asking?


#8

I’ll perform statistical analysis and data mining on bioinformatics and weather data. I thought I would have enough with Aersopike + Spark, (or other suggestions such as SciDB, Hadoop or whatever). I didn’t know I also need a file system like HDF5


#9

A lot depends on the nature of your data. For instance, if there are large binaries (say, video or high-resolution images), you would do best to keep them in native format. Or would you be doing something like translating human chromosomes or weather sensor data into key-value pairs? The latter is doable with Aerospike. The former would lend itself to some sort of data-store/metadata store architecture.


#10

If you want, give me a call - 650-906-3134 (or email pcorless@aerospike.com). I’d be interested in chatting with you about the nature of your project.


#11

If finally I start the project at my university I’ll tell you something. Anyway, I’ve just found that Aerospike Community Edition only accepts up to 2 nodes. Is it right? But in other docs you say “unlimited number of servers”


#12

That doesn’t sound right. You should be able to have more than 2 nodes.Can you explain what you were doing (reading, trying to enter a command) when you came to form such an opinion?


#13

Companies that want to experience the benefits of the Aerospike real-time database now have three options. The Aerospike Community Edition is a free unlimited license designed for environments that only need to support a single cluster of up to two nodes within one data center.

The Aerospike Enterprise Edition is an unlimited commercial license. It is designed for enterprises that want no limits on the amount of storage or number of data centers, clusters and nodes they can deploy.

Maybe I’m wrong and this refers to something different. Maybe the important concept for me is cluster. How many clusters can I deploy with the free edition?


#14

Ah! Good catch. The Community Edition is no longer node-limited. Or, I should say, it is limited the same way as the Enterprise Edition, which is limited to 128 - 1, or 127 nodes. I’ve made a note updating the press release.


#15

Also, with the free edition, you can deploy as many clusters as you want, but they would be independent clusters.

To do peer-to-peer cluster updates using Cross Datacenter Replication (XDR), you’d need the Enterprise Edition.

If you are a small startup and need the Enterprise Edition, you can get it for free under the Startup Special program.


#16

I just need the Community version to do some tests, without cluster replication.

Regards.