Understanding top command

Example Output

load averages:  0.16,  0.11,  0.09                puffy.echinopsys.de 01:55:44
55 processes: 54 idle, 1 on processor                       up 722 days, 22:38
CPU0 states:  0.0% user,  0.0% nice,  1.0% system,  1.0% interrupt, 98.0% idle
CPU1 states:  0.2% user,  0.0% nice,  2.4% system,  0.0% interrupt, 97.4% idle
Memory: Real: 167M/861M act/tot Free: 2104M Cache: 593M Swap: 0K/3312M

  PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND
 7055 www        2    0 6020K 7220K onproc/1  kqread    1:27 97.24% nginx
88770 _mysql     2    0  318M   62M idle      poll     25:41  0.00% mysqld
96529 _ntp       2  -20  888K 2480K sleep/1   poll      8:31  0.00% ntpd
99818 root       2    0  105M 3612K sleep/0   kqread    4:14  0.00% php-fpm-7.0
21945 root       2    0 9392K 3200K sleep/1   kqread    4:12  0.00% php-fpm-5.6
68396 _pflogd    4    0  716K  552K sleep/1   bpf       3:30  0.00% pflogd

So let's look into these values:

CPU load

The three numbers represent averages over progressively longer periods of time (one, five and fifteen minute averages).

These three numbers are shown with the uptime(1) and top(1) commands.

TL;DR

Long Story

Imagine a single-core CPU as a single lane highway. There can be 100 cars driving in a lane and then the highway is full.

Imagine that there are driving 100 cars down the highway. This means the highway is exactly at capacity. That's a load of 1 (=100%).

Imagine that another car wants to enter the highway. That means 101 cars wants to drive on the highway that only can handle 100 cars. So 1 car has to wait in a queue. Now the highway is overloaded by 1 car (=1%). This is a load of 1.01 (=101%).

If the highway shows a load of 0.00 then this means there's no traffic on the highway at all. In fact, between 0.00 and 1.00 means there's no queue, and an arriving car will just go right on.

So we can say the "load of this highway" is

In analogy to that the "CPU load" is

As we have seen in the example 1.00 means the highway is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.

So, your CPU load should ideally stay below 1.00. You are still okay if you get some temporary spikes above 1.00 ... but when you're consistently above 1.00, you need to worry.

But this means the ideal load is not 1.00. The problem with a load of 1.00 is that you have no certainty that your system runs well in the near future.

These are best practice when checking CPU load:

CPU Load Rule of Thumb
0.70 Need to Look into it
> 0.70 It's time to investigate before things get worse
5.0 You could be in serious trouble

But wait, why saying could at the load of 5.00?

Because On multi-processor system, the load is relative to the number of processor cores available. The "100% utilization" mark is 1.00 on a single-core system, 2.00 on a dual-core, 4.00 on a quad-core, etc.

Thinking of the highway example above the load of 1.00 means that the highway is at 100% capacity. But if the highway had two lanes then 1.00 would mean 50% capacity.

Same with CPUs: a load of 1.00 is 100% CPU utilization on single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.

CPU states

The CPU(s) row shows CPU state percentages based on the interval since the last refresh. Each value has a label.

The order of the labels can differ and also the title (a word or two letters) can be different in your version of top(1).

What is the Meaning of the 3 CPU states?

How to interpret the Values?

Use these Values to examine System Health

On a busy server you can expect the amount of time the CPU spends in idle to be small. However, if a system rarely has any idle time then it is either overloaded, or something is wrong.

If a system suddenly jumps from having spare CPU cycles to running flat out, then the first thing to check is the amount of time the CPU spends running user space processes. If this is high then it probably means that a process has gone crazy and is eating up all the CPU time. Like nginx(8) in the example output on the top of this page.

Sometimes a high kernel usage is acceptable. For example a program that does lots of console I/O can cause the kernel usage to spike. However if it remains higher for long periods of time then it could be an indication that something is wrong. A possible cause of such spikes could be a problem with a driver/kernel module.

High waiting on I/O means that there are some intensive I/O tasks running on the system that don't use up much CPU time. If this number is high for anything other than short bursts then it means that

Another indication of a broken peripheral could be high interrupt processing. Some hardware device is causing lots of hardware interrupts or a process is issuing lots of software interrupts.

On virtual machines a value for stolen time shows how long the virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. A large stolen time means that the host system running the hypervisor is too busy.

Memory Usage

This row includes information about physical and virtual memory allocation. It shows physical memory, classified as: total, used, free, buffers

Physical memory is your RAM, physical pieces of hardware that provide Random Access Memory.

Swap is virtual memory which can be a file or a partition on your hard drive that is essentially used as extra RAM. It is not a separate RAM chip though, it resides on your hard drive.

Process List

The last section provides information about the currently running processes. It consists of the following columns:

Column Content
PID Process Id : This is a unique number used to identify the process
USERNAME The username of whoever launched the process
PRI Priority : The priority of the process. Processes with higher priority will be favored by the kernel
NICE Nice value : value of setting the process' priority: -20..+20
SIZE The total amount of memory size of the process incl. text, data, and stack segments
RES Resident Memory Size: The non-swapped physical memory a task has used
STATE The current state of the process and the CPU number (only on multiprocessor)
WAIT If the process is asleep it displays the title of the wait channel
TIME The CPU time the process spends in system and user
CPU The CPU usage (default sort field)
COMMAND The name of the process (In angle brackets if the process is swapped)

What's next?

For further information see the excellent top(1) manpage