Linux Monitoring: Dedicated Server Administration Tips
When something goes wrong with an application on your dedicated server, the sys admin’s or developer’s first instinct is to see what could be causing the server to run slow. You run netstat, top, free, ps, vmstat, and iostat, but to make a conclusion you need to know exactly what you are looking at.
Dedicated server performance problems can be network, CPU, memory, or storage-related. Here we look at CPU and memory metrics to figure out what is going on in your machine. We use top to measure and display all of your Linux processes. Top is a command line that will help you parse through your processor activity. You will see processes in real time and sort them by different details.
Linux processes: from the top down
The top screen is divided into the cumulative view at the top and the metrics by process shown below that.
You can supply different command line options to top as described by the man pages to show different metrics. Here is the default view:
At the top, the screen shows the number of tasks (processes). If you run top with the H command, it lists threads.
A multithreaded program, such as Google Chrome, allows one program to run more than one task. This way, for example, Chrome can be downloading one page while you are looking at another tab and respond to events like clicking the scroll bar all at the same time. One process can spawn many threads. Those can have performance problems themselves when they become deadlocked as one process stops and waits for another to complete.
In the tasks line, we have 286 total: 1 running, 281 sleeping (keep reading to find out why this is not always accurate), 3 stopped* and 1 zombie*.
*stopped—To force a process to stop, say vi, press ‘control-z’ to stop it.
*zombie—this is a “child” process that was not properly killed off by its “parent”. It sounds ghastly, but that’s how the architecture reads.
We will move on to the next line for now, keep reading for more.
CPU metrics explained
Start with the third line in our image. After %cpu in the summary screen, we see ‘us’. That means user cpu time, or time spent handling a user program. In our first example, the machine is spending 2.7% of its time on this.
Are you familiar with the other values?
sy—system cpu time, or time spent with the kernel doing low-level functions such as scheduling tasks and responding to interrupts (see below) as opposed to running application program instructions (such as reading through an array of objects or doing math).
ni—a nice process, one that has a low priority. It is not a resource hog or in a hurry.
id—what’s sitting idle.
wa—the CPU is waiting for something to complete before it can start up again.
hi—hardware interrupts happen when a device sends information that need a response immediately.
si—software interrupts are the same thing, except they come from software.
st—steal time is stolen or taken by the host OS, meaning that operation is put on hold for whatever reason.
Kib Mem KiB Swap—these lines show memory that is both in use and free, either in RAM (random access or memory chips) or swapped out to disk virtual memory. Of course that swap space could be on solid-state storage to make paging in and out (swapping) run faster.
To illustrate further, the basic principle of cache means to move something from storage to memory. There is no latency when retrieving data from the cache since there are no moving parts, like a disk controller and spinning disk. Data records that can be put into cache memory can be recalled quickly. That portion of cache dedicated to caching disk reads and writes is called the buffer.
The Zombie Processes
The screenshot below shows a zombie process. To dig further, you would install adacontrol and then learn how to run ptree to print the process tree.
CPU Performance Per Core
Most computers have more than 1 CPU, so the percentage of the CPU used can be greater than 100%. If you press 1 while running top it shows %CPU by core. Also, a CPU can have more than one core, which is a partition of the CPU that acts as its own CPU).
The process level of top
The lower part of the top screen shows metrics by process.
You can use the cursor keeps to move up and down (i.e., scroll through the processes) and left to right (i.e., if it will not all fit on the screen).
The default values shown here are:
PR—scheduling priority. RT means realtime.
NI—nice value. If < 0 that means it has a higher priority.
VIRT—virtual memory size meaning code size, share libraries, and data created on the memory, like objects. So the program, subroutines it uses and memory the program consumes.
RES—physical memory used. This is reflected in the %mem calculation.
SHR—shared or space that could be shared with another process. That does not mean that it is shared now.
S—you might notice that all of the tasks here are shown as sleeping which for the person reading carefully leads to the question if they are sleeping then how can they be using cpu? According to the main pages for top, this value will only be accurate for a SMP (symmetric multiprocessing) processor. When we run this on an Intel Xeon, it shows running processes. When I run it on an Intel core I7 it does not.
%CPU—percentage of CPU utilized at the moment.
%MEM—percentage of memory utilized.
TIME—time dedicated to the task since it was started.
COMMAND—this is the crux of the issue as you want to know what program is doing what.
Sending eMail Alerts for Linux servers
The good news is that you can send email alerts to yourself when thresholds exceed certain values. The bad news is that Gmail and other email providers will probably block it as spam since your Linux server is not a widely used smtp server. So you could try to use Google’s smtp.gmail.com server instead. You would need to install mailutils or some other mailer. Or mail it to your company email address and ask the email admin to add a rule to whitelist your server if a spam rule is blocking you.
To monitor your server and send out alerts, you can run top in batch mode like this and then Grep out what you whatever text that you want:
top -b -n 1| grep %Cpu
%Cpu(s): 6,3 us, 2,0 sy, 0,1 ni, 90,3 id, 1,2 wa, 0,0 hi, 0,0 si, 0,0 st
Then write a shell script, or more easily a Python script, to parse the line using a regular expression to divide it into tokens. Then check each token against some threshold.
So there you have an overview of some of the performance metrics associated with CPU, process, and memory monitoring. So the next time you jump on an “all hands on deck” because the system is down, you can speak with authority about what you are looking at.