CPU unusually high on one node of 8 node cluster


#1

We are using 3.9.3 version for our 8 node cluster on I2.xlarge boxes We are using Memory + SSD configuration and replication factor 1 (1 master + 1 slave)

Recently we faced and issue where on just one node CPU went > 90% and disk IO was also > 90%. It happened for around 50min and then became normal

I did checked the box at that time using iotop and it showed asd process doing the io. Throughput was same on all the boxes. I am pretty sure this is not to do with and read writes happening from client side as it would have atleast affected one more node

What can be the possible cause for this spike on single node of aerospike ?


#2

Iotop shows io not CPU. Your thread subject is showing you think the bottleneck is CPU. I’m confused.

At any rate, everything should be distributed fairly evenly. Do you have proxy traffic on the cluster? Can you show us a few snap shots of “show latency” in asadm? Also the output of info and “show distribution” and “show config diff” would be useful.


#3

Disk IO was high as well as cpu was high and only on one node We have all paramters cpu,latency etc monitored on newrelic and latencies and throughput were similar on all nodes

Output of show distribution :slight_smile:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~userdata - TTL Distribution in Seconds~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                    Percentage of records having ttl less than or equal to value measured in Seconds                                   
                                               Node      10%      20%       30%       40%       50%       60%       70%       80%        90%       100%   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000   466560   622080    933120   1555200   1866240   2177280   2488320   5443200   11352960   15552000   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000   466560   622080    933120   1555200   1866240   2177280   2488320   5443200   11352960   15552000   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000   466560   622080    933120   1555200   1866240   2177280   2488320   5443200   11352960   15552000   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000   466560   622080    933120   1555200   1866240   2177280   2488320   5443200   11352960   15552000   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000   466560   622080    933120   1555200   1866240   2177280   2488320   5443200   11352960   15552000   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000    466563   622084    933126   1555210   1866252   2177294   2488336   5443235   11353033   15552100   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000    466560   622080    933120   1555200   1866240   2177280   2488320   5443200   11352960   15552000   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000    359350   539025   1078050   1437400   1796750   2156100   2515450   5390250   11319525   17967500   
Number of rows: 8

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~user_config_data - TTL Distribution in Seconds~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                Percentage of records having ttl less than or equal to value measured in Seconds                
                                               Node   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000      0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000      0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000      0     0     0     0     0     0     0     0     0      0   
Number of rows: 8

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~userdata - Object Size Distribution in Record Blocks~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Percentage of records having objsz less than or equal to value measured in Record Blocks            
                                               Node   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000     2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000     2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000     2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000     2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000     2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000      2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000      2     2     3     6     7     7     8    10    10     99   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000      2     2     3     6     7     7     8    10    10     99   
Number of rows: 8

~~~~~~~~~~~~~~~~~~~~~~~~~~user_config_data - Object Size Distribution in Record Blocks~~~~~~~~~~~~~~~~~~~~~~~~~~
            Percentage of records having objsz less than or equal to value measured in Record Blocks            
                                               Node   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000     0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000      0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000      0     0     0     0     0     0     0     0     0      0   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000      0     0     0     0     0     0     0     0     0      0   
Number of rows: 8

Output of show configuration diff

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE           :   10.0.23.145:3000   10.0.23.186:3000   10.0.23.230:3000   10.0.23.96:3000   ip-10-0-23-112.ap-southeast-1.compute.internal:3000   ip-10-0-23-154.ap-southeast-1.compute.internal:3000   ip-10-0-23-164.ap-southeast-1.compute.internal:3000   ip-10-0-23-190.ap-southeast-1.compute.internal:3000   ip-10-0-23-219.ap-southeast-1.compute.internal:3000   ip-10-0-23-89.ap-southeast-1.compute.internal:3000   ip-10-0-23-94.ap-southeast-1.compute.internal:3000   ip-10-0-23-95.ap-southeast-1.compute.internal:3000   
migrate-threads:   N/E                N/E                N/E                N/E               1                                                     1                                                     1                                                     1                                                     1                                                     1                                                    3                                                    1                                                    

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                            :   10.0.23.145:3000   10.0.23.186:3000   10.0.23.230:3000   10.0.23.96:3000   ip-10-0-23-112.ap-southeast-1.compute.internal:3000                                                                                    ip-10-0-23-154.ap-southeast-1.compute.internal:3000                                                                                    ip-10-0-23-164.ap-southeast-1.compute.internal:3000                                                                                    ip-10-0-23-190.ap-southeast-1.compute.internal:3000                                                                                    ip-10-0-23-219.ap-southeast-1.compute.internal:3000                                                                                    ip-10-0-23-89.ap-southeast-1.compute.internal:3000                                                                                      ip-10-0-23-94.ap-southeast-1.compute.internal:3000                                                                    ip-10-0-23-95.ap-southeast-1.compute.internal:3000                                                                                     
heartbeat.addresses             :   N/E                N/E                N/E                N/E               10.0.23.112:3002                                                                                                                       10.0.23.154:3002                                                                                                                       10.0.23.164:3002                                                                                                                       10.0.23.190:3002                                                                                                                       10.0.23.219:3002                                                                                                                       10.0.23.89:3002                                                                                                                         10.0.23.94:3002                                                                                                       10.0.23.95:3002                                                                                                                        
heartbeat.mesh-seed-address-port:   N/E                N/E                N/E                N/E               10.0.23.154:3002,10.0.23.164:3002,10.0.23.190:3002,10.0.23.219:3002,10.0.23.230:3002,10.0.23.89:3002,10.0.23.94:3002,10.0.23.95:3002   10.0.23.112:3002,10.0.23.164:3002,10.0.23.190:3002,10.0.23.219:3002,10.0.23.230:3002,10.0.23.89:3002,10.0.23.94:3002,10.0.23.95:3002   10.0.23.112:3002,10.0.23.154:3002,10.0.23.190:3002,10.0.23.219:3002,10.0.23.230:3002,10.0.23.89:3002,10.0.23.94:3002,10.0.23.95:3002   10.0.23.112:3002,10.0.23.154:3002,10.0.23.164:3002,10.0.23.219:3002,10.0.23.230:3002,10.0.23.89:3002,10.0.23.94:3002,10.0.23.95:3002   10.0.23.112:3002,10.0.23.154:3002,10.0.23.164:3002,10.0.23.190:3002,10.0.23.230:3002,10.0.23.89:3002,10.0.23.94:3002,10.0.23.95:3002   10.0.23.112:3002,10.0.23.154:3002,10.0.23.164:3002,10.0.23.190:3002,10.0.23.219:3002,10.0.23.230:3002,10.0.23.94:3002,10.0.23.95:3002   10.0.23.144:3002,10.0.23.154:3002,10.0.23.164:3002,10.0.23.219:3002,10.0.23.89:3002,10.0.23.95:3002,10.0.23.96:3002   10.0.23.154:3002,10.0.23.164:3002,10.0.23.186:3002,10.0.23.190:3002,10.0.23.219:3002,10.0.23.89:3002,10.0.23.94:3002,10.0.23.96:3002   

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~userdata Namespace Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   ip-10-0-23-112.ap-southeast-1.compute.internal:3000   ip-10-0-23-154.ap-southeast-1.compute.internal:3000   ip-10-0-23-164.ap-southeast-1.compute.internal:3000   ip-10-0-23-190.ap-southeast-1.compute.internal:3000   ip-10-0-23-219.ap-southeast-1.compute.internal:3000   ip-10-0-23-89.ap-southeast-1.compute.internal:3000   ip-10-0-23-94.ap-southeast-1.compute.internal:3000   ip-10-0-23-95.ap-southeast-1.compute.internal:3000   
migrate-order                  :   5                                                     5                                                     5                                                     5                                                     5                                                     5                                                    1                                                    1                                                    
storage-engine.cold-start-empty:   false                                                 false                                                 false                                                 false                                                 false                                                 false                                                true                                                 true

#4

Please post the latency snapshots as well. Wanting to know if there are proxy trans


#5

What are proxy trans ?

By latency snapshots you mean the output of asloglatency for that time period ?


#6

Can you show us a few snap shots of “show latency” in asadm?


#7

Proxy: In rare cases during cluster re-configurations when the Client Layer may be briefly out of date, the Transaction Processing module transparently proxys the request to another node.

http://www.aerospike.com/docs/architecture


#8

Following are couple of snapshots :

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
                                                  .                 Span         .      .      .       .   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000   04:08:09->04:08:19    1163.1   0.64    0.0     0.0   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000   04:08:18->04:08:28    1195.3   0.56    0.0     0.0   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000   04:08:11->04:08:21    1158.1   0.92    0.0     0.0   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000   04:08:16->04:08:26    1052.5   2.11   0.02     0.0   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000   04:08:12->04:08:22    1102.3   1.14    0.0     0.0   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000    04:08:08->04:08:18    1109.7   0.74    0.0     0.0   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000    04:08:14->04:08:24    1051.5   0.68    0.0     0.0   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000    04:08:10->04:08:20    1123.0   0.86    0.0     0.0   
Number of rows: 8

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
                                                  .                 Span         .      .      .       .   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000   04:08:09->04:08:19    1088.5   0.39    0.0     0.0   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000   04:08:18->04:08:28    1275.7   0.29    0.0     0.0   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000   04:08:11->04:08:21    1133.6   0.44    0.0     0.0   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000   04:08:16->04:08:26    1087.0   0.72    0.0     0.0   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000   04:08:12->04:08:22    1109.4   1.02   0.03     0.0   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000    04:08:08->04:08:18    1079.5    0.4    0.0     0.0   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000    04:08:14->04:08:24    1062.5   0.35    0.0     0.0   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000    04:08:10->04:08:20    1122.4   0.57   0.01     0.0   
Number of rows: 8
[root@ip-10-0-23-154 aerospike]# asadm -e "show latency"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
                                                  .                 Span         .      .      .       .   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000   04:08:49->04:08:59    1117.3   0.95    0.0     0.0   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000   04:08:48->04:08:58    1138.7   0.76    0.0     0.0   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000   04:08:51->04:09:01    1178.1   0.76    0.0     0.0   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000   04:08:56->04:09:06     994.2   2.57    0.0     0.0   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000   04:08:52->04:09:02    1283.6    2.7   0.02     0.0   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000    04:08:48->04:08:58    1119.9   0.79    0.0     0.0   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000    04:08:54->04:09:04    1108.4   1.09    0.0     0.0   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000    04:08:50->04:09:00    1157.5   1.14   0.01     0.0   
Number of rows: 8

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
                                                  .                 Span         .      .      .       .   
ip-10-0-23-112.ap-southeast-1.compute.internal:3000   04:08:49->04:08:59    1292.6   0.63    0.0     0.0   
ip-10-0-23-154.ap-southeast-1.compute.internal:3000   04:08:48->04:08:58    1340.3    0.4    0.0     0.0   
ip-10-0-23-164.ap-southeast-1.compute.internal:3000   04:08:51->04:09:01    1318.3   0.37    0.0     0.0   
ip-10-0-23-190.ap-southeast-1.compute.internal:3000   04:08:56->04:09:06    1149.3   1.12    0.0     0.0   
ip-10-0-23-219.ap-southeast-1.compute.internal:3000   04:08:52->04:09:02    1371.6   1.99   0.15     0.0   
ip-10-0-23-89.ap-southeast-1.compute.internal:3000    04:08:48->04:08:58    1187.0   0.39    0.0     0.0   
ip-10-0-23-94.ap-southeast-1.compute.internal:3000    04:08:54->04:09:04    1327.3   0.56   0.01     0.0   
ip-10-0-23-95.ap-southeast-1.compute.internal:3000    04:08:50->04:09:00    1293.0   0.73   0.01     0.0   
Number of rows: 8

#9

Ah damn asadm doesn’t show that. I thought it did, sorry. At any rate, things look even,y spread. Is the issue occurring at the time of the snapshot, or only during specific times?

Also, I noticed in your config diff some nodes report 1,3, and NA for migrate threads. I assume you’ve tried bringing these all down to 1?

Have you seen any exceptions in the calling app? Wondering about hot keys.

Lastly, as part of 3.12 they released a cool asadm command called health. Try it out and let us know if it finds anything


#10

This was the first instance of such an observation. Not an hot key issue as there were no such errors at the client side.

Can there be some background process that possibly got running in aerospike during that time ?


#11

I’m really not sure. If this is important to resolve, it may be worth getting a contract with the Aerospike folks for support. http://www.aerospike.com/contact-us/ They always have come through for me :slight_smile: