Investigation of regular high load on unused machines every 7 hours




#define LOAD_FREQ (5HZ+1) / 5 sec intervals */

So, contrary to the comment there, the load is not measured every 5s (HZ=1000 in CentOS6) but every 5.001 seconds!

Now, let’s do the math. If we add one ms every five s, how long does it take to add another full five s? In other words, what is the interference period? We need to add one ms 5000x. If we add it every five s, it takes us 25000s. Which, converted to hours, is: six hours, 56 minutes and 40 seconds. So we were slightly wrong, the spikes did not occur every seven hours.

The regular high load was really caused by launching a bunch of monitoring scripts every minute and by the fact that the load measurements slowly and slightly moves with a period of almost 7 hours.

Well, yeah. It turned out that it was really happening everywhere. But to our defense, on a much, much smaller scale. Why? First, most of our machines are much more powerful, so the spike in load is not that big and it does not trigger the threshold for alerts (it is based on load_per_core value). Second, most of the machines actually do something, so you won’t notice a small spike in load occurring every ~seven hours in the plot, as the load curve is not stable anyway. And third, the majority of the hosts only have a few collectd exec plugins configured, so the number of processes executed at one moment is significantly smaller.




Linux内核定义一个长度为3的双字数组avenrun,双字的低11位用于存放负载的小数部分,高21位用于存放整数部分。当进程所耗的 CPU时间片数超过CPU在5秒内能够提供的时间片数时,内核计算上述的三个负载,负载初始化为0。

假设最近1、5、15分钟内的平均负载分别为 load1、load5和load15,那么下一个计算时刻到来时,内核通过下面的算式计算负载:
load1 -= load1 - exp(-5 / 60) -+ n (1 - exp(-5 / 60 ))
load5 -= load5 - exp(-5 / 300) + n (1 - exp(-5 / 300))
load15 = load15 exp(-5 / 900) + n (1 - exp(-5 / 900))

有兴趣的同学还可以更加深♂入的了解一下linux的Load Average算法