Hadoop Monitoring with Puypuy

Puypuy uses Hadoop's native JMX-to-JSON HTTP interface to gather performance metrics from both the NameNode and DataNodes in an HDFS cluster.

All Hadoop-related checks share a single configuration file: hadoop.ini.

Hadoop Overview


🟦 HDFS NameNode

NameNode

πŸ”§ Installation

cd ${PUYPUY_HOME}/checks_enabled
ln -s ../checks_available/check_hadoop_namenode.py ./

In production environments, the NameNode typically listens on a non-loopback interface. Ensure you use the correct external IP address of the NameNode:

βš™οΈ Configuration

[Hadoop-NameNode]
jmx = http://${NAMENODE_IP}:50070/jmx

πŸ”„ Restart Agent

${PUYPUY_HOME}/puypuy.sh restart

πŸ“Š Metrics Provided

Name Description Type Unit
namenode_addblockops Adblock operations per second rate OPS
namenode_blockcapacity HDFS Block Capacity gauge None
namenode_blockstotal HDFS Total blocks gauge None
namenode_capacityremaining HDFS remaining free space gauge Bytes
namenode_capacitytotal HDFS Total capacity gauge Bytes
namenode_capacityused HDFS Used space gauge Bytes
namenode_corruptblocks HDFS corrupt blocks gauge None
namenode_createfileops NameNode Create file operations on nameNode rate OPS
namenode_deletefileops NameNode Delete file operations on nameNode rate OPS
namenode_fileinfoops NameNode File information requests rate OPS
namenode_filesdeleted NameNode deleted files rate OPS
namenode_filesrenamed NameNode rename file operations rate OPS
namenode_filestotal Amount of files in HDFS gauge None
namenode_getblocklocations NameNode Get Block Location operations rate OPS
namenode_getlistingops NameNode Get LIsting operations rate OPS
namenode_heap_committed Java Heap memory committed gauge Bytes
namenode_heap_init Java Heap memory init gauge Bytes
namenode_heap_max Java Heap memory max gauge Bytes
namenode_heap_used Java Heap memory used gauge Bytes
namenode_lastgc_duration Last garbage collections duration gauge Milliseconds
namenode_missingblocks NameNode missing blocks gauge None
namenode_nondfsusedspace Non HDFS disk space usage gauge Bytes
namenode_nonheap_committed Java Non Heap memory committed gauge Bytes
namenode_nonheap_init Java Non Heap memory init gauge Bytes
namenode_nonheap_max Java Non Heap memory max gauge Bytes
namenode_nonheap_used Java Non Heap memory used gauge Bytes
namenode_numdeaddatanodes Number of dead DataNodes in cluster gauge None
namenode_numdecomdeaddatanodes Number of decommissioned dead DataNodes in cluster gauge None
namenode_numdecomlivedatanodes Number of decommissioned live DataNodes in cluster gauge None
namenode_numdecommissioningdatanodes Number of decommissioning DataNodes in cluster gauge None
namenode_numlivedatanodes Number of live DataNodes in cluster gauge None
namenode_numstaledatanodes Number of stale DataNodes in cluster gauge None
namenode_numstalestorages Number of staled Storages in cluster gauge None
namenode_openfiledescriptorcount NaeNode process open files descriptors count gauge None
namenode_pendingreplicationblocks Amount of pending for replication blocks in HDFS gauge None
namenode_percentremaining HDFS Storage space remaining gauge Percent
namenode_receivedbytes Namenode received bytes gauge Bytes
namenode_scheduledreplicationblocks Amount of blocks scheduled for replication gauge None
namenode_sentbytes NameNode sent bytes gauge Bytes
namenode_transactionsnumops NameNode transactions count gauge None
namenode_underreplicatedblocks HDFS under replicated blocks count gauge None

🟩 HDFS DataNode

Most DataNode installations bind to 0.0.0.0:50075, so no custom configuration is typically required.

πŸ”§ Installation

cd ${PUYPUY_HOME}/checks_enabled
ln -s ../checks_available/check_hadoop_datanode.py ./

βš™οΈ Configuration

Usually HDFS DataNode binds on 0.0.0.0:50075, so no extra configuration is needed.

If you have specific case, please make sure to change 127.0.0.1 to IP address matching you DataNode bind address.

[Hadoop-Datanode]
jmx: http://127.0.0.1:50075/jmx

πŸ”„ Restart Agent

${PUYPUY_HOME}/puypuy.sh restart

Provides

Name Description Type Unit
datanode_bytesread DataNode read bytes per second rate Bytes
datanode_byteswritten DataNode write bytes per second rate Bytes
datanode_capacity Disk space on current DataNode gauge Bytes
datanode_dfsused Current DataNode’s used disk space gauge Bytes
datanode_du_percent Current DataNodes disk usage in percents gauge Percent
datanode_heap_committed DataNode JVM heap committed gauge Bytes
datanode_heap_init DataNode JVM heap init gauge Bytes
datanode_heap_max DataNode JVM Heap max gauge Bytes
datanode_heap_used DataNode JVM heap used gauge Bytes
datanode_lastgc_duration Duration of last garbage collection gauge Milliseconds
datanode_nonheap_committed DataNode JVM non heap committed gauge Bytes
datanode_nonheap_init DataNode JVM non Heap init gauge Bytes
datanode_nonheap_max DataNode JVM non heap max gauge Bytes
datanode_nonheap_used Datanode JVM non Heap used gauge Bytes
datanode_openfiles Datanode daemon’s open files descriptors count gauge None
datanode_space_remaining Disk space remaining on current DataNode gauge Bytes
datanode_totalreadtime Read operations time on current DataNode rate Milliseconds
datanode_totalwritetime Write operations time on current DataNode rate Milliseconds

πŸ’‘ Best Practices

  • πŸ” Security: Always restrict ՝/jmx՝ to internal networks or use firewalls.
  • πŸ“ˆ Dashboards: Visualize metrics in Grafana or similar for better observability.
  • 🚨 Troubleshooting: Watch for high GC times, under-replication, or dead DataNodes as early warnings of cluster instability.

πŸ“ˆ Example Grafana Dashboard

Grafana