One of the most important components in our network of servers is our Apache Zookeeper server farm. Although Zookeeper tends to just work well on its own without much handholding, its still important to us to monitor its health and the overall activity on the servers as we grow our infrastructure.
On all of our servers we leverage Collectd as a monitoring system. This interacts closely with our cloud management provider (RightScale), as well as allowing us to duplicate metrics to systems like Graphite or Librato. One of our favorite features of Collectd is how simple it is to implement custom monitoring plugins for almost any new piece of techonology we use.
For Zookeeper, we wrote a simple plugin that allows us to graph a number of different data points from our individual Zookeeper nodes. The plugin essentially runs in a simple loop gathering live statistics from the Zookeeper ‘four letter words' management commands and dumping them out in Collectd text protocol.
PUTVAL “localhost/zookeeper/gauge-connections” interval=5 N:52
PUTVAL “localhost/zookeeper/gauge-outstanding-requests” interval=5 N:0
PUTVAL “localhost/zookeeper/gauge-nodes” interval=5 N:516
PUTVAL “localhost/zookeeper/gauge-latency-min” interval=5 N:0
PUTVAL “localhost/zookeeper/gauge-latency-avg” interval=5 N:0
PUTVAL “localhost/zookeeper/gauge-latency-max” interval=5 N:798
PUTVAL “localhost/zookeeper/if_packets-traffic” interval=5 1368031744:104287487:104273708
PUTVAL “localhost/zookeeper/gauge-local-watches-total” interval=5 N:96
PUTVAL “localhost/zookeeper/gauge-local-watches-unique-paths” interval=5 N:22
With the monitor in place, we’re able to keep track of most of the important runtime stats for the Zookeeper service very easily. Its important to keep in mind that these are internal application stats, not overall system or java monitoring stats. Those can be monitored separately using other Collectd plugins.
The code for this plugin is available in our Github repository.
Sr. Systems Architect