Aerospike connectors to ingest data from Aerospike into machine learning / statistical tools like R

r

#1

Hi,

I am not able to find any connectors to ingest data from aerospike into R to perform statistical and predictive analytics algrotihms.

Is there a way to ingest data from aerospike to perform predictive analytics and other statistical stuff?

Thanks & Regards, Samar


#2

Hi Samar,

The ingest will depend on which toolchain you’re using.

We have a repo with some hadoop and thus spark connectors, and some basic operations on it. This includes some spark analytics examples on aerospike.

This tooling will also allow you to easily get data from Aerospike to HDFS / Hbase / Hadoop, as well as to run MapReduce jobs on aerospike data without “ingest”.

A guy named Sasha published a nice Spark RDD example for Aerospike.

We have published a Storm client integration. It creates both spouts and bolts that read and write from Aerospike.

We have published an example real-time recommendation engine as a stand alone example.

A gentleman was doing some predictive Caltrans / traffic analytics with a great tool called Dato (was Graphlab) and Aerospike but I can’t find his DevWeek talk online http://dato.com

We have not done integration with R clients. I’m fond of the R language for similar small-data processing (limited to in-memory quick jobs), but we haven’t done an integration. As we have a C client and a Java client, both open source, I would expect that anyone who wanted to port/publish the connector would have a reasonable time.

Let me know what tooling you’re using, and perhaps I can be more specific.


#3

Hello,

Thanks for your detailed reply. We will not be using Spark. Instead, we will be using Storm with Aerospike and its great that you have already made available connectors for Storm. Maybe we could feed the data from Aerospike into the Storm Trident ML library. Thanks for the dato.com tip. It was a v interesting read.

I think the speed of Aerospike and availability of streaming UDF’s could be make for a use case of building some analytics on our own. But for hardcore machine learning algorithms which perhaps require offline processing, we might use the Trident library.

Please let me know your thoughts on this.

Many thanks, Samar


#5

I strongly believe using Aerospike as the “temporary store” for streaming work is a great case. Many of those algorithms should use a shared store for temporary data instead of machine-local, because with machine-local if a machine crashes you’ve lost a lot of state.

Which framework to use is a much harder question, and is determined by a lot of factors. The trident library for doing exactly once transactions has a number of plusses and minuses, and I’d suggest really testing it for your performance level. You might also want to look at Akka.