System Checks - Cloud scale intelligent monitoring & analytics platform

Linux OS Monitoring

Install

cd ${PUYPUY_HOME}/checks_enabled
ln -s ../checks_available/{check_cpustats.py,check_disks.py,check_load_average.py,check_memory.py,check_network_bytes.py} ./

Configure

For most of installations, our defaults works well, but you can edit conf/system.ini if you need fine tuning. For each of system check you van enable or disable static alerts and set appropriate thresholds.

[Load Average]
static_enabled: True
high: 95
severe : 100

[Disk Stats]
static_enabled: True
high : 90
severe : 95

[Memory Stats]
static_enabled: True
high : 90
severe : 95

[CPU Stats]
static_enabled: True
percore_stats: False
high: 90
severe: 95

[Network Stats]
localhost = False
rated = True

Restart

${PUYPUY_HOME}/puypuy.sh restart

CPU

System CPU statistics are collected via check_cpustats.py module: It reads /proc/stat file for CPU related information. The Check does not require any additional dependencies, and it should provide following metrics:

Provides

Name	Description	Type	Unit
cpu_idle	percent of free CPU resources	gauge	Percent
cpu_iowait	CPU percent spent on waiting for I/O operation	gauge	Percent
cpu_irq	CPU percent spent on handling hardware interrupts	gauge	Percent
cpu_load	Total CPU/Core load percent	gauge	Percent
cpu_nice	CPU percent of processing user level processes with positive nice	gauge	Percent
cpu_softirq	CPU percent spent on handling soft interrupts	gauge	Percent
cpu_system	CPU percent of processing system level processes	gauge	Percent
cpu_user	CPU percent of processing user level processes	gauge	Percent

Memory

Collected via check_memory.py module, it takes memory related information from /proc/meminfo file, and provides following metrics:

Provides

Name	Description	Type	Unit
mem_active	Memory that is being used by a particular process	gauge	Bytes
mem_available	Not active / free memory	gauge	Bytes
mem_buffers	The total amount of memory used for critical system buffers	gauge	Bytes
mem_cached	Amount of cached data. Free’d if there is not enough free memory in the system.	gauge	Bytes
mem_inactive	Memory that was allocated to a process that is no longer running	gauge	Bytes
mem_swapcached	Amount of swapped memory	gauge	Bytes
mem_total	Total amount of memory	gauge	Bytes
mem_used_percent	Aggregated from metrics above, total memory usage percentage	gauge	Percent

Disk

Collected via check_disks.py module, this check uses several resources, to provide statistics about disk IO and Space usage. Read/write statistics are taken from /sys/block/{DISK_NAME}/stat files, also we use Linux df command, to get information about space usage. IO statistics are taken from /proc/diskstats. For each disk we take following metrics:

Provides

Name	Description	Type	Unit
drive_bytes_available	Amounts of unused bytes	gauge	Bytes
drive_bytes_used	Amounts of used bytes	gauge	Bytes
drive_io_percent_used	Percent of used IO resources per second of disk drive	gauge	Percent
drive_percent_used	Percentage of disk space usage	gauge	Percent
drive_reads	Read operations per second performed on disk drive	rate	Bytes
drive_writes	Write operations per second performed on disk drive	rate	Bytes

Network

Collected via check_network_bytes.py module, It collects metrics about all installed interfaces by reading /sys/class/net/{NIC}/statistics/rx_bytes file.

Provides

Name	Description	Type	Unit
bytes_rx	Amounts of received bytes	rate	Bytes
bytes_tx	Amounts of sent bytes	rate	Bytes

IP Conntrack

Enable this module only if you use connection tracking. This check makes sense, if you have router like system, or if by some reason nf_conntrack kernel module is loaded. It reads /proc/sys/net/ipv4/netfilter/ip_conntrack_max|ip_conntrack_count files and provides :

Provides

Name	Description	Type	Unit
conntrack_max	Maximum configured Conntack value	gauge	None
conntrack_cur	Current IP Conntack value	gauge	None

Load Average

This is one of the most important metrics in Linux (Maybe even the most). System load average shows ammount of processes, waiting in queue for CPU. It is calculater by 1,5 and 15 minute averages.In most of live systems it must have a value, which is less than total amount of CPUs detected by system. For example of your Server is equiped with 2x Quad cores CPUs, with enabled hyper trading Linux will see 16 CPUs, so Load Average should be below 16. Our check provides:

Provides

Name	Description	Type	Unit
sys_load_1	System load average for last minute	gauge	None
sys_load_5	System load average for last 5 minutes	gauge	None
sys_load_15	System load average for last 15 minutes	gauge	None

As regular checks and also triggers special WARNING, when value of sys_load_1 is more that 90% of amount of CPU cores and ERROR, when its equal or more that 100%. This behavior can be changed by editing check_load_average.py and changing :

warn_level = 90
crit_level = 100

to desired values. However these values are quite reasonable so, use it without modifications, if you are not for 100% sure that you need to change it

TCP Connections

This check provides status of TCP connections to systems. It parses /proc/net/tcp and provides following metrics.

Provides

Name	Description	Type	Unit
tcp_close	TCP connections with CLOSED state	gauge	None
tcp_close_wait	The remote side has been shut down and is now waiting for the socket to close	gauge	None
tcp_closing	TCP connections in closing state	gauge	None
tcp_established	The socket has a connection established	gauge	None
tcp_fin_wait1	The socket is closed, and the connection is now shutting down.	gauge	None
tcp_last_ack	TCP connections in last ack state	gauge	None
tcp_listen	TCP listening sockets	gauge	None
tcp_max_states	TCP connections in max state	gauge	None
tcp_new_syn_recv	TCP connections in new syn recv state	gauge	None
tcp_syn_recv	TCP connections in syn recvstate	gauge	None
tcp_syn_sent	TCP connections in syn recv state	gauge	None
tcp_time_wait	TCP connections in time wait state	gauge	None

On some very heavy loaded systems, this check may become expensive, please make sure its suits your system resources before enabling it on systems with 20k+ TCP ESTABLISHED connections.

BTRFS check

BTRFS check can be very useful in combinations with regular Drive IO checks on systems which uses BTRFS file system. Its monitors BTRFS volumes and checks for volume errors. It also contains special check which will send manually defined ERROR and WARNING messages if values of checked parameters are above Zero. Manual Error handling can be enabled or disable by setting up enable_special variable at the top of script. Its accepts True or False parameters, defaults is True.

Provides

btrfs_dev_{NAME}_corruption_errs
btrfs_dev_{NAME}_flush_io_errs
btrfs_dev_{NAME}_generation_errs
btrfs_dev_{NAME}_read_io_errs
btrfs_dev_{NAME}_write_io_errs