Hadoop Monitoring with Puypuy
Puypuy uses Hadoop's native JMX-to-JSON HTTP interface to gather performance metrics from both the NameNode and DataNodes in an HDFS cluster.
All Hadoop-related checks share a single configuration file: hadoop.ini
.
π¦ HDFS NameNode
π§ Installation
cd ${PUYPUY_HOME}/checks_enabled
ln -s ../checks_available/check_hadoop_namenode.py ./
In production environments, the NameNode typically listens on a non-loopback interface. Ensure you use the correct external IP address of the NameNode:
βοΈ Configuration
[Hadoop-NameNode]
jmx = http://${NAMENODE_IP}:50070/jmx
π Restart Agent
${PUYPUY_HOME}/puypuy.sh restart
π Metrics Provided
Name | Description | Type | Unit |
---|---|---|---|
namenode_addblockops | Adblock operations per second | rate | OPS |
namenode_blockcapacity | HDFS Block Capacity | gauge | None |
namenode_blockstotal | HDFS Total blocks | gauge | None |
namenode_capacityremaining | HDFS remaining free space | gauge | Bytes |
namenode_capacitytotal | HDFS Total capacity | gauge | Bytes |
namenode_capacityused | HDFS Used space | gauge | Bytes |
namenode_corruptblocks | HDFS corrupt blocks | gauge | None |
namenode_createfileops | NameNode Create file operations on nameNode | rate | OPS |
namenode_deletefileops | NameNode Delete file operations on nameNode | rate | OPS |
namenode_fileinfoops | NameNode File information requests | rate | OPS |
namenode_filesdeleted | NameNode deleted files | rate | OPS |
namenode_filesrenamed | NameNode rename file operations | rate | OPS |
namenode_filestotal | Amount of files in HDFS | gauge | None |
namenode_getblocklocations | NameNode Get Block Location operations | rate | OPS |
namenode_getlistingops | NameNode Get LIsting operations | rate | OPS |
namenode_heap_committed | Java Heap memory committed | gauge | Bytes |
namenode_heap_init | Java Heap memory init | gauge | Bytes |
namenode_heap_max | Java Heap memory max | gauge | Bytes |
namenode_heap_used | Java Heap memory used | gauge | Bytes |
namenode_lastgc_duration | Last garbage collections duration | gauge | Milliseconds |
namenode_missingblocks | NameNode missing blocks | gauge | None |
namenode_nondfsusedspace | Non HDFS disk space usage | gauge | Bytes |
namenode_nonheap_committed | Java Non Heap memory committed | gauge | Bytes |
namenode_nonheap_init | Java Non Heap memory init | gauge | Bytes |
namenode_nonheap_max | Java Non Heap memory max | gauge | Bytes |
namenode_nonheap_used | Java Non Heap memory used | gauge | Bytes |
namenode_numdeaddatanodes | Number of dead DataNodes in cluster | gauge | None |
namenode_numdecomdeaddatanodes | Number of decommissioned dead DataNodes in cluster | gauge | None |
namenode_numdecomlivedatanodes | Number of decommissioned live DataNodes in cluster | gauge | None |
namenode_numdecommissioningdatanodes | Number of decommissioning DataNodes in cluster | gauge | None |
namenode_numlivedatanodes | Number of live DataNodes in cluster | gauge | None |
namenode_numstaledatanodes | Number of stale DataNodes in cluster | gauge | None |
namenode_numstalestorages | Number of staled Storages in cluster | gauge | None |
namenode_openfiledescriptorcount | NaeNode process open files descriptors count | gauge | None |
namenode_pendingreplicationblocks | Amount of pending for replication blocks in HDFS | gauge | None |
namenode_percentremaining | HDFS Storage space remaining | gauge | Percent |
namenode_receivedbytes | Namenode received bytes | gauge | Bytes |
namenode_scheduledreplicationblocks | Amount of blocks scheduled for replication | gauge | None |
namenode_sentbytes | NameNode sent bytes | gauge | Bytes |
namenode_transactionsnumops | NameNode transactions count | gauge | None |
namenode_underreplicatedblocks | HDFS under replicated blocks count | gauge | None |
π© HDFS DataNode
Most DataNode installations bind to 0.0.0.0:50075, so no custom configuration is typically required.
π§ Installation
cd ${PUYPUY_HOME}/checks_enabled
ln -s ../checks_available/check_hadoop_datanode.py ./
βοΈ Configuration
Usually HDFS DataNode binds on 0.0.0.0:50075, so no extra configuration is needed.
If you have specific case, please make sure to change 127.0.0.1 to IP address matching you DataNode bind address.
[Hadoop-Datanode]
jmx: http://127.0.0.1:50075/jmx
π Restart Agent
${PUYPUY_HOME}/puypuy.sh restart
Provides
Name | Description | Type | Unit |
---|---|---|---|
datanode_bytesread | DataNode read bytes per second | rate | Bytes |
datanode_byteswritten | DataNode write bytes per second | rate | Bytes |
datanode_capacity | Disk space on current DataNode | gauge | Bytes |
datanode_dfsused | Current DataNodeβs used disk space | gauge | Bytes |
datanode_du_percent | Current DataNodes disk usage in percents | gauge | Percent |
datanode_heap_committed | DataNode JVM heap committed | gauge | Bytes |
datanode_heap_init | DataNode JVM heap init | gauge | Bytes |
datanode_heap_max | DataNode JVM Heap max | gauge | Bytes |
datanode_heap_used | DataNode JVM heap used | gauge | Bytes |
datanode_lastgc_duration | Duration of last garbage collection | gauge | Milliseconds |
datanode_nonheap_committed | DataNode JVM non heap committed | gauge | Bytes |
datanode_nonheap_init | DataNode JVM non Heap init | gauge | Bytes |
datanode_nonheap_max | DataNode JVM non heap max | gauge | Bytes |
datanode_nonheap_used | Datanode JVM non Heap used | gauge | Bytes |
datanode_openfiles | Datanode daemonβs open files descriptors count | gauge | None |
datanode_space_remaining | Disk space remaining on current DataNode | gauge | Bytes |
datanode_totalreadtime | Read operations time on current DataNode | rate | Milliseconds |
datanode_totalwritetime | Write operations time on current DataNode | rate | Milliseconds |
π‘ Best Practices
- π Security: Always restrict Υ/jmxΥ to internal networks or use firewalls.
- π Dashboards: Visualize metrics in Grafana or similar for better observability.
- π¨ Troubleshooting: Watch for high GC times, under-replication, or dead DataNodes as early warnings of cluster instability.
π Example Grafana Dashboard