It looks like Node restarting make all partitions migrate

from old forum post by sunguck.lee » Tue Jul 29, 2014 7:41 pm

Hi Young.

In the bottom of you article (“How can I tell when a migration is finished?”, https://forums.aerospike.com/viewtopic.php?f=13&t=83), you said

“N is the number of partitions the server is currently sending”

Is it correct ? Don’t you mean “receiving” rather than “sending” After that, you said “This is not the number of receives …”, So I just wondering…

And, one more question about Aerospike migration process. I have 10 node cluster and each node have 90GB SSD data and 15GB Memory data.

This aerospike cluster(whole node of the cluster, not only restarted one) is still migrating data (after 2 hours) after restarting one node. There’s no put/get like operations, just the cluster is in the idle status. During migrating, each node’s disk read iops is over 5000 without any disk writing. And migration progress is never changed (never getting down, looks like progress is stayed at 310 ~ 370) Is it normal ?

And If I restart one node, then Aerospike cluster will redistribute all partitions (about 4096 partitions) across all node in the cluster ? I think it just okay that moving a few partitions on each node to restarted node.

Monitor> info
===NODES===
2014-07-29 15:19:26.135428
Sorting by IP, in Ascending order:
ip:port Build Cluster Cluster Free Free Migrates Node Principal Replicated Sys
. Size Visibility Disk Mem . ID ID Objects Free
. . . pct pct . . . . Mem
test041 3.3.9 10 true 72 45 (340,0) BB9C246077AC40C BB9E044077AC40C 228,223,701 54
test042 3.3.9 10 true 71 42 (395,1) BB91045077AC40C BB9E044077AC40C 240,574,968 51
test043 3.3.9 10 true 73 46 (332,1) BB96846077AC40C BB9E044077AC40C 223,898,443 54
test044 3.3.9 10 true 71 42 (358,1) BB9E044077AC40C BB9E044077AC40C 240,589,286 50
test045 3.3.9 10 true 72 44 (350,1) BB96A46077AC40C BB9E044077AC40C 234,250,380 52
test046 3.3.9 10 true 71 42 (372,1) BB90446077AC40C BB9E044077AC40C 241,421,654 50
test047 3.3.9 10 true 73 46 (333,1) BB92807077AC40C BB9E044077AC40C 223,086,577 54
test048 3.3.9 10 true 69 39 (395,2) BB91646077AC40C BB9E044077AC40C 252,037,486 49
test049 3.3.9 10 true 73 46 (337,1) BB9B046077AC40C BB9E044077AC40C 224,203,341 54
test050 3.3.9 10 true 70 41 (350,1) BB99046077AC40C BB9E044077AC40C 243,440,002 50
Number of nodes displayed: 10


===NAMESPACE===
Total (unique) objects in cluster for perfdb : 1,175,862,919
Note: Total (unique) objects is an under estimate if migrations are in progress.


ip/namespace Avail Evicted Master Repl Stop Used Used Used Used hwm hwm
Pct Objects Objects Factor Writes Disk Disk Mem Mem Disk Mem
. . . . . . % . % . .
test048/perfdb 69 0 127,449,744 2 false 90.14 G 31 15.02 G 61 50 60
test050/perfdb 70 0 119,141,229 2 false 87.06 G 30 14.51 G 59 50 60
test042/perfdb 71 0 128,614,336 2 false 86.04 G 29 14.34 G 58 50 60
test044/perfdb 71 0 118,573,313 2 false 86.04 G 29 14.34 G 58 50 60
test046/perfdb 71 0 116,822,774 2 false 86.34 G 29 14.39 G 58 50 60
test041/perfdb 72 0 113,400,242 2 false 81.62 G 28 13.60 G 55 50 60
test045/perfdb 72 0 113,969,742 2 false 83.77 G 28 13.96 G 56 50 60
test047/perfdb 73 0 111,963,028 2 false 79.78 G 27 13.30 G 54 50 60
test043/perfdb 73 0 113,972,609 2 false 80.07 G 27 13.35 G 54 50 60
test049/perfdb 73 0 111,955,902 2 false 80.18 G 27 13.36 G 54 50 60
Number of rows displayed: 10

from old forum Post by anshu » Tue Jul 29, 2014 11:58 pm

hi sunguck.lee ,

“N is the number of partitions the server is currently sending”

Is it correct ? Don’t you mean “receiving” rather than “sending” After that, you said “This is not the number of receives …”, So I just wondering…

Yes, you are correct. Thanks for catching it. We will update it.

This aerospike cluster(whole node of the cluster, not only restarted one) is still migrating data (after 2 hours) after restarting one node. There’s no put/get like operations, just the cluster is in the idle status. During migrating, each node’s disk read iops is over 5000 without any disk writing. And migration progress is never changed (never getting down, looks like progress is stayed at 310 ~ 370) Is it normal ?

Yes, migrations take a lot of time depending upon the amount of data in your cluster. We slow down the migrations so that it does not impact normal production operations. Also, the disk reads will happen as we read the data to migrate it over to new node. There will be writes as well. But we do batch writes, so you will see lesser writes compared to reads. You can increase the rate of migration if you are aware of your production load:

http://www.aerospike.com/docs/operation … r-decrease

You should keep a close eye on your production stats if you are changing the migration speed.

And If I restart one node, then Aerospike cluster will redistribute all partitions (about 4096 partitions) across all node in the cluster ? I think it just okay that moving a few partitions on each node to restarted node.

Yes, a restart of any one node will cause migrations on all nodes. This is to ensure that the cluster always has all the required data available. The overhead for various scenarios (node never comes back or new node with no data is added etc.) is the least if we assume the same process in all scenarios - i.e, complete migration of data in any “cluster state” change. We do not know about the sanity of data on existing nodes, so we do complete data migration to ensure that entire data is available as expected, even if its the same node which comes back and joins.

let us know if you have any more questions.


Postby Hanson » Wed Jul 30, 2014 1:09 am

For 10 nodes cluster: if one node restarted, are all 4096 partitions re-distributed, or only 4096/10 = 410 partitions re-distributed? I expect the later one. Hanson


Postby sunguck.lee » Wed Jul 30, 2014 2:28 am

Hi Anshu.

According to my understanding, Just one node restarting cause whole cluster partitions(it might be 4096 partitions) are relocated (redistributed). Is it right ?

And, I am testing just 10 node cluster, and each node have 90 GB Disk data and 15GB index data on memory. I just restarted one node of cluster 7 hours ago, But migration process is still doing something. (Not completed yet) Each node doing 5k/sec read (50~60MB / sec) (Actually , 50 seconds 5k/sec reads and sleep about 30 seconds repeatedly)

With simple calculation, Each node read disk data about 769GB during migration (There’s no operation, just only doing migration). ==> (50Seconds/80Seconds) * 7Hours * 60Minutes * 60Seconds * 50MB = 769GB

I think there’s too many disk read even though it’s whole partitions relocating. How can I reduce migration time ?

And one more thing is If I upgrade hardware or aerospike engine, I think I have to rolling restart one node by one node. But just one node restarting (migration) takes 7~8 hours, I need about 7 * 10 (node count) hours. Should I have to wait and upgrade next node until migration completed ? or I can restart next node after startup-completed message (I mean “cake” message) ?

Really thanks.


Postby anshu » Wed Aug 06, 2014 2:41 am

Hi sunguck.lee,

According to my understanding, Just one node restarting cause whole cluster partitions(it might be 4096 partitions) are relocated (redistributed). Is it right ?

Yes, your understanding is correct presently.

And, I am testing just 10 node cluster, and each node have 90 GB Disk data and 15GB index data on memory. I just restarted one node of cluster 7 hours ago, But migration process is still doing something. (Not completed yet) Each node doing 5k/sec read (50~60MB / sec) (Actually , 50 seconds 5k/sec reads and sleep about 30 seconds repeatedly)

With simple calculation, Each node read disk data about 769GB during migration (There’s no operation, just only doing migration). ==> (50Seconds/80Seconds) * 7Hours * 60Minutes * 60Seconds * 50MB = 769GB

I think there’s too many disk read even though it’s whole partitions relocating.

How are you measuring the reads? iostat will not be the correct way to measure it as the amount of reads will depend upon object count and size. Also, defrag will cause reads as well.

How can I reduce migration time ?

The below talks about how to slow down migrations. You can increase the given parameters to make it faster. http://www.aerospike.com/docs/operations/tune/migration/

And one more thing is If I upgrade hardware or aerospike engine, I think I have to rolling restart one node by one node. But just one node restarting (migration) takes 7~8 hours, I need about 7 * 10 (node count) hours. Should I have to wait and upgrade next node until migration completed ? or I can restart next node after startup-completed message (I mean “cake” message) ?

For rolling restart - you only need to wait for the cake. You do not need to wait for the entire migrations to finish before proceeding to the next node.


Postby anshu » Wed Aug 06, 2014 2:44 am

Hi Hanson

For 10 nodes cluster: if one node restarted, are all 4096 partitions re-distributed, or only 4096/10 = 410 partitions re-distributed? I expect the later one.

Presently, any cluster change will cause migration of all the partitions. We do not trust the sanity of existing partitions and verifying the same will be similar overhead.