I am using latest AeroSpark connector to work with Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G # Maximum memory allocation for data and
default-ttl 30d
storage-engine device { # Configure the storage-engine to use
file /vol/rmla.data # Location of data file on server.
filesize 900G # Max size of each file in GiB.
}