Linux OS Monitoring

Install

cd ${OE_AGENT_HOME}/checks_enabled
ln -s ../checks_available/{check_cpustats.py,check_disks.py,check_load_average.py,check_memory.py,check_network_bytes.py} ./

Configure

For most of installations, our defaults works well, but you can edit conf/system.ini if you need fine tuning. For each of system check you van enable or disable static alerts and set appropriate thresholds.

[Load Average]
static_enabled: True
high: 95
severe : 100

[Disk Stats]
static_enabled: True
high : 90
severe : 95

[Memory Stats]
static_enabled: True
high : 90
severe : 95

[CPU Stats]
static_enabled: True
percore_stats: False
high: 90
severe: 95

[Network Stats]
localhost = False
rated = True

Restart

${OE_AGENT_HOME}/oddeye.sh restart
CPU

System CPU statistics are collected via check_cpustats.py module: It reads /proc/stat file for CPU related information. The Check does not require any additional dependencies, and it should provide following metrics:

Provides

Name Description Type Unit
cpu_idle percent of free CPU resources gauge Percent
cpu_iowait CPU percent spent on waiting for I/O operation gauge Percent
cpu_irq CPU percent spent on handling hardware interrupts gauge Percent
cpu_load Total CPU/Core load percent gauge Percent
cpu_nice CPU percent of processing user level processes with positive nice gauge Percent
cpu_softirq CPU percent spent on handling soft interrupts gauge Percent
cpu_system CPU percent of processing system level processes gauge Percent
cpu_user CPU percent of processing user level processes gauge Percent
Memory

Collected via check_memory.py module, it takes memory related information from /proc/meminfo file, and provides following metrics:

Provides

Name Description Type Unit
mem_active Memory that is being used by a particular process gauge Bytes
mem_available Not active / free memory gauge Bytes
mem_buffers The total amount of memory used for critical system buffers gauge Bytes
mem_cached Amount of cached data. Free’d if there is not enough free memory in the system. gauge Bytes
mem_inactive Memory that was allocated to a process that is no longer running gauge Bytes
mem_swapcached Amount of swapped memory gauge Bytes
mem_total Total amount of memory gauge Bytes
mem_used_percent Aggregated from metrics above, total memory usage percentage gauge Percent
Disk

Collected via check_disks.py module, this check uses several resources, to provide statistics about disk IO and Space usage. Read/write statistics are taken from /sys/block/{DISK_NAME}/stat files, also we use Linux df command, to get information about space usage. IO statistics are taken from /proc/diskstats. For each disk we take following metrics:

Provides

Name Description Type Unit
drive_bytes_available Amounts of unused bytes gauge Bytes
drive_bytes_used Amounts of used bytes gauge Bytes
drive_io_percent_used Percent of used IO resources per second of disk drive gauge Percent
drive_percent_used Percentage of disk space usage gauge Percent
drive_reads Read operations per second performed on disk drive rate Bytes
drive_writes Write operations per second performed on disk drive rate Bytes
Network

Collected via check_network_bytes.py module, It collects metrics about all installed interfaces by reading /sys/class/net/{NIC}/statistics/rx_bytes file.

Provides

Name Description Type Unit
bytes_rx Amounts of received bytes rate Bytes
bytes_tx Amounts of sent bytes rate Bytes
IP Conntrack

Enable this module only if you use connection tracking. This check makes sense, if you have router like system, or if by some reason nf_conntrack kernel module is loaded. It reads /proc/sys/net/ipv4/netfilter/ip_conntrack_max|ip_conntrack_count files and provides :

Provides

Name Description Type Unit
conntrack_max Maximum configured Conntack value gauge None
conntrack_cur Current IP Conntack value gauge None
Load Average

This is one of the most important metrics in Linux (Maybe even the most). System load average shows ammount of processes, waiting in queue for CPU. It is calculater by 1,5 and 15 minute averages.In most of live systems it must have a value, which is less than total amount of CPUs detected by system. For example of your Server is equiped with 2x Quad cores CPUs, with enabled hyper trading Linux will see 16 CPUs, so Load Average should be below 16. Our check provides:

Provides

Name Description Type Unit
sys_load_1 System load average for last minute gauge None
sys_load_5 System load average for last 5 minutes gauge None
sys_load_15 System load average for last 15 minutes gauge None

As regular checks and also triggers special WARNING, when value of sys_load_1 is more that 90% of amount of CPU cores and ERROR, when its equal or more that 100%. This behavior can be changed by editing check_load_average.py and changing :

warn_level = 90
crit_level = 100

to desired values. However these values are quite reasonable so, use it without modifications, if you are not for 100% sure that you need to change it

TCP Connections

This check provides status of TCP connections to systems. It parses /proc/net/tcp and provides following metrics.

Provides

Name Description Type Unit
tcp_close TCP connections with CLOSED state gauge None
tcp_close_wait The remote side has been shut down and is now waiting for the socket to close gauge None
tcp_closing TCP connections in closing state gauge None
tcp_established The socket has a connection established gauge None
tcp_fin_wait1 The socket is closed, and the connection is now shutting down. gauge None
tcp_last_ack TCP connections in last ack state gauge None
tcp_listen TCP listening sockets gauge None
tcp_max_states TCP connections in max state gauge None
tcp_new_syn_recv TCP connections in new syn recv state gauge None
tcp_syn_recv TCP connections in syn recvstate gauge None
tcp_syn_sent TCP connections in syn recv state gauge None
tcp_time_wait TCP connections in time wait state gauge None

On some very heavy loaded systems, this check may become expensive, please make sure its suits your system resources before enabling it on systems with 20k+ TCP ESTABLISHED connections.

BTRFS check

BTRFS check can be very useful in combinations with regular Drive IO checks on systems which uses BTRFS file system. Its monitors BTRFS volumes and checks for volume errors. It also contains special check which will send manually defined ERROR and WARNING messages if values of checked parameters are above Zero. Manual Error handling can be enabled or disable by setting up enable_special variable at the top of script. Its accepts True or False parameters, defaults is True.

Provides

btrfs_dev_{NAME}_corruption_errs
btrfs_dev_{NAME}_flush_io_errs
btrfs_dev_{NAME}_generation_errs
btrfs_dev_{NAME}_read_io_errs
btrfs_dev_{NAME}_write_io_errs