Aerospike set to csv file

What’s the best way to dump an Aerospike table/set to a csv file? Assume the set has a very simple schema as: PK key (duplicate of PK, as PK may not be saved with data) value

Thanks. Ming

The key is always stored if the client writes with the sendKey policy.

There currently isn’t a tool to export a set to a CSV file. You would need to scan the set and have the client serialize to CSV.

If you have aerospike-spark connector, you can do that by creating a DataFrame in Spark on the Aerospike set, read data into Spark and export data in CSV format to a file in HDFS.

     spark.conf.set("aerospike.seedhost", dbHost)
     spark.conf.set("aerospike.port", dbPort)
     spark.conf.set("aerospike.namespace",namespace)
     spark.conf.set("aerospike.set", dbSet)
     spark.conf.set("aerospike.keyPath", "/etc/aerospike/features.conf")
     spark.conf.set("aerospike.user", dbConnection)
     spark.conf.set("aerospike.password", dbPassword)

     val sqlContext = spark.sqlContext

// create DataFrame on top of Aerospike set   
val dfBatchRead  = sqlContext.read.
      format("com.aerospike.spark.sql").
      option("aerospike.batchMax", 10000).
      load

This returns the following

scala>    val dfBatchRead  = sqlContext.read.
     |       format("com.aerospike.spark.sql").
     |       option("aerospike.batchMax", 10000).
     |       load
dfBatchRead: org.apache.spark.sql.DataFrame = [__key: string, __digest: binary ... 7 more fields]

// Show the schema
scala> dfBatchRead.printSchema
root
 |-- __key: string (nullable = true)
 |-- __digest: binary (nullable = false)
 |-- __expiry: integer (nullable = false)
 |-- __generation: integer (nullable = false)
 |-- __ttl: integer (nullable = false)
 |-- price: double (nullable = true)
 |-- rowkey: string (nullable = true)
 |-- timeissued: string (nullable = true)
 |-- ticker: string (nullable = true)

//Only choose the bins that you want and show the two first records as an example

scala> dfBatchRead.select('price, 'rowkey, 'timeissued, 'ticker).take(2).foreach(println)
[295.95,06a41fb4-5215-4dad-8f99-826c7ab34f6f,2019-07-02T23:00:02,MRW]
[266.1,12b085ea-1127-48fe-b5fa-0c7947b7d4a7,2019-07-04T09:48:52,VOD]

// Save it to csv file in HDFS. coalesce(1) means create one partition only

dfBatchRead.select('price, 'rowkey, 'timeissued, 'ticker).coalesce(1).write.csv("/tmp/test.csv")

Get it from the Edge node. It creates a directory called test.csv under the file system and a long file name

hdfs dfs -get /tmp/test.csv

cd test.csv

mv part-00000-dc6b5d62-cad1-467e-9440-fc0ebfaa5467-c000.csv test.csv

head test.csv
38.23,dde599ee-6aaf-4621-bbd6-a79fd510b072,2019-07-02T17:51:08,SAP
489.54,ec247768-0c55-465b-8304-c96c5d3c7874,2019-07-02T09:32:29,BP
324.38,d012cffd-2f3a-418a-b8e2-2176f703c0d2,2019-07-02T15:03:53,MRW
22.36,fd7ca950-54cc-49dd-97a7-496d9cf7e84d,2019-07-02T19:05:15,MSFT
33.09,ef7d1fde-dc86-460f-8929-d3ec918c0364,2019-07-02T14:42:04,ORCL
274.5,2483f9ab-337b-4bc9-a8f8-797759db2a97,2019-07-02T19:31:41,SBRY

Here you are using Spark as ETL tool. You can even migrate tables/collections from another DB to Aerospike set etc.

HTH

Thanks kporter and Mich_Talebzadeh.

Now I am using scan first and maybe spark later.

I checked online documents at: https://www.aerospike.com/docs/client/python/usage/scan/scan.html

I tried following steps and it worked: scan = client.scan(‘test’, ‘demo’) scan.select(‘name’, ‘age’) scan.foreach(print_result)

Can I limit number of records to scan, especially during testing and development. Something like:

Select * from test.demo limit 1000

Thanks.

I am not that familiar with python here. However, in Scala you can do say using DataFrame and functional programming.

DataFrame.collect.take(1000).foreach(println)

For example:

DataFrame.select("name", "age").orderBy("age").take(1000).foreach(println)

May be something similar in python?

HTH