Problem :
The node of our hadoop cluster is running redhat5.3 2.6.18-194.17.4. (a old kernel version). We found some hosts are under 100% CPU utilization and specially all the CPU cores are on 100% sy%
top - 20:56:21 up 340 days, 22:28, 1 user, load average: 2297.16, 2298.69, 2298.88
Tasks: 17923 total, 132 running, 17753 sleeping, 0 stopped, 38 zombie
Cpu(s): 0.2%us, 99.7%sy, 0.1%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 35840000k total, 33995836k used, 1844164k free, 2432312k buffers
Swap: 0k total, 0k used, 0k free, 12193444k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3362 eo 18 0 0 0 0 Z 10.0 0.0 101:32.83 java <defunct>
12818 eo 22 0 3896m 1.2g 18m S 7.4 3.6 728:05.05 java
21396 qifei 19 0 26240 13m 812 R 6.1 0.0 9:48.80 top
1425 eo 18 0 3632m 1.0g 26m D 4.2 3.0 42:11.92 java
1398 eo 15 0 0 0 0 Z 4.2 0.0 41:09.95 java <defunct>
1595 eo 18 0 0 0 0 Z 3.8 0.0 41:11.94 java <defunct>
6079 root 25 0 93744 19m 3004 R 3.7 0.1 20:34.63 apolloHostComma
6254 root 25 0 8068 456 380 R 3.7 0.0 20:28.19 date
2671 root 25 0 25004 3996 1404 R 2.5 0.0 265:33.27 apolloHostComma
4573 root 25 0 23420 2352 1376 R 2.5 0.0 20:10.33 apolloHostComma
4710 root 25 0 25400 4436 1404 R 2.5 0.0 19:50.97 apolloHostComma
5047 root 25 0 174m 17m 5852 R 2.5 0.1 19:19.46 yum
5568 root 25 0 25136 4104 1404 R 2.5 0.0 19:36.23 apolloHostComma
5649 root 25 0 24344 3296 1400 R 2.5 0.0 19:54.40 apolloHostComma
6132 root 25 0 25004 4056 1404 R 2.5 0.0 19:26.55 apolloHostComma
7084 snitch 25 0 8708 252 112 R 2.5 0.0 20:06.13 sh
7201 root 25 0 8368 716 584 R 2.5 0.0 19:27.99 ps
7749 root 25 0 27808 2840 1484 R 2.5 0.0 19:58.13 auth-sync.pl
7975 root 25 0 31168 4000 1548 R 2.5 0.0 20:04.87 report
7977 root 25 0 9772 772 476 R 2.5 0.0 19:55.76 apollo-polling-
8174 snitch 25 0 8708 708 588 R 2.5 0.0 19:52.57 sh
8307 eo 25 0 26008 3000 1480 R 2.5 0.0 19:49.94 perl
8583 root 25 0 25268 4296 1404 R 2.5 0.0 19:05.10 apolloHostComma
9832 eo 18 0 0 0 0 Z 2.5 0.0 18:08.24 java <defunct>
9856 eo 18 0 3454m 12m 7572 D 2.5 0.0 18:08.24 java
9882 eo 18 0 0 0 0 Z 2.5 0.0 18:24.09 java <defunct>
666 root 25 0 174m 17m 5876 R 2.5 0.1 12:47.36 yum
1343 root 25 0 74820 1240 592 R 2.5 0.0 277:03.67 crond
1571 eo 18 0 3649m 563m 26m D 2.5 1.6 20:27.40 java
1601 eo 18 0 0 0 0 Z 2.5 0.0 21:15.44 java <defunct>
2858 root 25 0 24872 3944 1404 R 2.5 0.0 20:30.74 apolloHostComma
2881 root 25 0 53016 15m 1852 R 2.5 0.0 19:25.97 apolloHostComma
3166 root 25 0 29396 4340 1452 R 2.5 0.0 264:38.79 RotateLogFiles.
4392 root 25 0 29988 6980 1520 R 2.5 0.0 20:59.13 apolloHostComma
4608 root 25 0 55224 15m 1804 R 2.5 0.0 20:46.56 apolloHostComma
4624 root 25 0 24740 3808 1404 R 2.5 0.0 20:46.17 apolloHostComma
4637 root 25 0 25004 4036 1404 R 2.5 0.0 20:46.43 apolloHostComma
4681 root 25 0 28736 3608 1452 R 2.5 0.0 20:55.49 RotateLogFiles.
4760 eo 18 0 0 0 0 Z 2.5 0.0 20:04.55 java <defunct>
4979 root 25 0 74820 860 212 R 2.5 0.0 19:58.63 crond
5023 root 25 0 25484 2492 1472 R 2.5 0.0 19:41.18 auth-sync.pl
5460 eo 25 0 23288 2220 1272 R 2.5 0.0 19:37.19 cron-babysit
5551 eo 25 0 31916 6912 1608 R 2.5 0.0 19:36.55 cron-babysit
5560 root 25 0 22496 696 532 R 2.5 0.0 20:42.10 report
5564 root 25 0 8708 244 92 R 2.5 0.0 19:36.86 SnitchAgentCont
From the first several rows of top output, it is not obvious to tell how CPU is consumed.
Sometimes, we did see kswapd0 is on top rows, this could be caused by the fact that we have no swap space.
It is impossible to print the command line of the java process with top, ps, or /proc//cmdline, because the console will hang if we do so.
My question is: How can we find out what is pegging the cpu in the kernel.
Solution :
The system has 17923 processes, out of which 132 in Running state.
The rate at which the running processes are scheduled is high enough to yield steady CPU load averages of almost 2300. That scheduling itself and in general managing the entire process list and the resources they use is likely the bulk of your 99.7% sy
value – using far more CPU than for actually executing the running processes (the remaining 0.3% in us
and ni
combined).
I also see several zombies around – they might indicate some misbehaving programs, but they could also indicate that the system is so busy that it can’t even find time to cleanup defunct processes (which would also fall in the sy
category, BTW).
You need to cleanup a large portion of those processes if you want to get any level of decent performance out of this machine.