high-load-every-7-hours

诡异的每隔7个小时出现的异常高负载

Investigation of regular high load on unused machines every 7 hours

原文如下:

https://blog.avast.com/investigation-of-regular-high-load-on-unused-machines-every-7-hours

这里贴一下结论:

#define LOAD_FREQ (5HZ+1) / 5 sec intervals */

So, contrary to the comment there, the load is not measured every 5s (HZ=1000 in CentOS6) but every 5.001 seconds!

Now, let’s do the math. If we add one ms every five s, how long does it take to add another full five s? In other words, what is the interference period? We need to add one ms 5000x. If we add it every five s, it takes us 25000s. Which, converted to hours, is: six hours, 56 minutes and 40 seconds. So we were slightly wrong, the spikes did not occur every seven hours.

The regular high load was really caused by launching a bunch of monitoring scripts every minute and by the fact that the load measurements slowly and slightly moves with a period of almost 7 hours.

Well, yeah. It turned out that it was really happening everywhere. But to our defense, on a much, much smaller scale. Why? First, most of our machines are much more powerful, so the spike in load is not that big and it does not trigger the threshold for alerts (it is based on load_per_core value). Second, most of the machines actually do something, so you won’t notice a small spike in load occurring every ~seven hours in the plot, as the load curve is not stable anyway. And third, the majority of the hosts only have a few collectd exec plugins configured, so the number of processes executed at one moment is significantly smaller.

附上一篇阿里云上的文章:

https://yq.aliyun.com/articles/484253

从这个函数中可以看到,内核计算load采用的是一种平滑移动的算法,Linux的系统负载指运行队列的平均长度,需要注意的是:可运行的进程是指处于运行队列的进程,不是指正在运行的进程。即进程的状态是TASK_RUNNING或者TASK_UNINTERRUPTIBLE。

Linux内核定义一个长度为3的双字数组avenrun,双字的低11位用于存放负载的小数部分,高21位用于存放整数部分。当进程所耗的 CPU时间片数超过CPU在5秒内能够提供的时间片数时,内核计算上述的三个负载,负载初始化为0。

假设最近1、5、15分钟内的平均负载分别为 load1、load5和load15,那么下一个计算时刻到来时,内核通过下面的算式计算负载:
load1 -= load1 - exp(-5 / 60) -+ n (1 - exp(-5 / 60 ))
load5 -= load5 - exp(-5 / 300) + n (1 - exp(-5 / 300))
load15 = load15 exp(-5 / 900) + n (1 - exp(-5 / 900))
其中,exp(x)为e的x次幂,n为当前运行队列的长度。

有兴趣的同学还可以更加深♂入的了解一下linux的Load Average算法

https://www.teamquest.com/import/pdfs/whitepaper/ldavg2.pdf