What’s the best way to dump an Aerospike table/set to a csv file? Assume the set has a very simple schema as: PK key (duplicate of PK, as PK may not be saved with data) value
Thanks. Ming
What’s the best way to dump an Aerospike table/set to a csv file? Assume the set has a very simple schema as: PK key (duplicate of PK, as PK may not be saved with data) value
Thanks. Ming
The key is always stored if the client writes with the sendKey
policy.
There currently isn’t a tool to export a set to a CSV file. You would need to scan the set and have the client serialize to CSV.
If you have aerospike-spark connector, you can do that by creating a DataFrame in Spark on the Aerospike set, read data into Spark and export data in CSV format to a file in HDFS.
spark.conf.set("aerospike.seedhost", dbHost)
spark.conf.set("aerospike.port", dbPort)
spark.conf.set("aerospike.namespace",namespace)
spark.conf.set("aerospike.set", dbSet)
spark.conf.set("aerospike.keyPath", "/etc/aerospike/features.conf")
spark.conf.set("aerospike.user", dbConnection)
spark.conf.set("aerospike.password", dbPassword)
val sqlContext = spark.sqlContext
// create DataFrame on top of Aerospike set
val dfBatchRead = sqlContext.read.
format("com.aerospike.spark.sql").
option("aerospike.batchMax", 10000).
load
This returns the following
scala> val dfBatchRead = sqlContext.read.
| format("com.aerospike.spark.sql").
| option("aerospike.batchMax", 10000).
| load
dfBatchRead: org.apache.spark.sql.DataFrame = [__key: string, __digest: binary ... 7 more fields]
// Show the schema
scala> dfBatchRead.printSchema
root
|-- __key: string (nullable = true)
|-- __digest: binary (nullable = false)
|-- __expiry: integer (nullable = false)
|-- __generation: integer (nullable = false)
|-- __ttl: integer (nullable = false)
|-- price: double (nullable = true)
|-- rowkey: string (nullable = true)
|-- timeissued: string (nullable = true)
|-- ticker: string (nullable = true)
//Only choose the bins that you want and show the two first records as an example
scala> dfBatchRead.select('price, 'rowkey, 'timeissued, 'ticker).take(2).foreach(println)
[295.95,06a41fb4-5215-4dad-8f99-826c7ab34f6f,2019-07-02T23:00:02,MRW]
[266.1,12b085ea-1127-48fe-b5fa-0c7947b7d4a7,2019-07-04T09:48:52,VOD]
// Save it to csv file in HDFS. coalesce(1) means create one partition only
dfBatchRead.select('price, 'rowkey, 'timeissued, 'ticker).coalesce(1).write.csv("/tmp/test.csv")
Get it from the Edge node. It creates a directory called test.csv under the file system and a long file name
hdfs dfs -get /tmp/test.csv
cd test.csv
mv part-00000-dc6b5d62-cad1-467e-9440-fc0ebfaa5467-c000.csv test.csv
head test.csv
38.23,dde599ee-6aaf-4621-bbd6-a79fd510b072,2019-07-02T17:51:08,SAP
489.54,ec247768-0c55-465b-8304-c96c5d3c7874,2019-07-02T09:32:29,BP
324.38,d012cffd-2f3a-418a-b8e2-2176f703c0d2,2019-07-02T15:03:53,MRW
22.36,fd7ca950-54cc-49dd-97a7-496d9cf7e84d,2019-07-02T19:05:15,MSFT
33.09,ef7d1fde-dc86-460f-8929-d3ec918c0364,2019-07-02T14:42:04,ORCL
274.5,2483f9ab-337b-4bc9-a8f8-797759db2a97,2019-07-02T19:31:41,SBRY
Here you are using Spark as ETL tool. You can even migrate tables/collections from another DB to Aerospike set etc.
HTH
Thanks kporter and Mich_Talebzadeh.
Now I am using scan first and maybe spark later.
I checked online documents at: https://www.aerospike.com/docs/client/python/usage/scan/scan.html
I tried following steps and it worked: scan = client.scan(‘test’, ‘demo’) scan.select(‘name’, ‘age’) scan.foreach(print_result)
Can I limit number of records to scan, especially during testing and development. Something like:
Select * from test.demo limit 1000
Thanks.
I am not that familiar with python here. However, in Scala you can do say using DataFrame and functional programming.
DataFrame.collect.take(1000).foreach(println)
For example:
DataFrame.select("name", "age").orderBy("age").take(1000).foreach(println)
May be something similar in python?
HTH