Nextdoor Service Registry

A few months back we released a small but integral piece of our infrastructure called zk_watcher.  This tool, in combination with Apache ZooKeeperallows servers to dynamically and centrally register the services they provide at boot-time.  Client applications are then able to request these server lists dynamically from ZooKeeper upon startup.

Since then we’ve only increased our usage of ZooKeeper throughout our software and infrastructure stack for dynamic configuration in the cloud. We register our Memcached, PostgreSQL, ElasticSearch, RabbitMQ, and even backend Syslog servers all with the simple zk_watcher tool.

Now we’d like to introduce the foundational code that zk_watcher is based on, which our Django application uses for service discovery: The Nextdoor Service Registry module (nd_service_registry).

Introducing nd_service_registry

The Nextdoor Service Registry Python module is the core piece of code that we use for interacting with Zookeeper from inside of zk_watcher and Django. The goal of this package is to provide simple common methods for registering and querying for services in ZooKeeper while handling as many potential failure scenarios as possible.

Why are we creating another module when there are already client modules like Kazoo, zc.zk, and Pookeeper out there? The issue here is that all of those modules are relatively low-level and put the onus of failure handling, existing node conflicts, ACLs, etc, all on you as the developer. Additionally each of them store data in ZooKeeper in slightly different ways, making the behavior from one to the next a bit unpredictable.

Major Feature Discussion

Simplicity… it Sells!
One of the core features of the nd_service_registry module is that its simple to use. You merely need to have ZooKeeper running somewhere, and within just a couple of lines of code you can create persistent (but ephemeral) service registrations.

»> import nd_service_registry
»> nd = KazooServiceRegistry()
»> nd.set_node(‘/services/ssh/server1:22’, data={ ‘foo’: ‘bar’})

This example code creates all of the objects necessary to register the path /services/ssh/server1:22, monitor the backend ZooKeeper connection, handle any connection failure that might happen, and re-register in the event of these failures. You don’t have to do anything else to maintain the registration.

Retrieving a list of servers from ZooKeeper is just as simple

»> import nd_service_registry
»> nd = KazooServiceRegistry()
»> nd.get(‘/services/ssh’)

{'children':
{u'server1:22':
{u'foo': u'bar', u'created': u'2012-12-15 00:45:03', u'pid': 10733}}, 'data': None, 'path': '/services/ssh', 'stat': ZnodeStat(czxid=6, mzxid=6, ctime=1355532303688, mtime=13
55532303688, version=0, cversion=1, aversion=0, ephemeralOwner=0,
dataLength=0, numChildren=1, pzxid=7)}

The best part of this code is that you can tell the ServiceRegistry object to call one of your methods any time this list changes, providing a simple way for your application to always stay up to date. Here’s another simple example.

»> import nd_service_registry, pprint
»> nd = KazooServiceRegistry()
»> def print_list(nodes):
»>    print “Found (%d) new servers…” % len(nodes[children])
»> nd.get(‘/services/ssh’, callback=print_list)

Found (1) new servers...

Reliable Failure Handling
The other feature of this module is that it handles most connection failures gracefully and in a sane way. The two main scenarios that the module handles are initial startup connection failure and connection failures after the app has been running for a while.

In the event that the ZooKeeper farm is not available during the initial startup phase of your app, the ServiceRegistry object can supply results for the nd.get() calls from local cache files that it keeps up-to-date with your server lists. This means that as long as your app had worked at one point, it will use the “last known” list of servers for each path you request. In the meantime, the object will continue to try to connect to ZooKeeper in the background. As soon the connection is restored, it will grab the latest server lists from ZooKeeper and update all of the objects as necessary, including triggering any of your registered callback functions.

Once your ServiceRegistry object is initialized and connected to ZooKeeper, it will also pay attention to any connection failures throughout its lifetime. If the connection fails, it will continue to retry the connection in the background until the connection can be recovered. Once the new connection is made, any path registrations will be re-registered immediately. Again, no work is required on your part for this to happen. It’s automagic.

Final Thoughts…

This Nextdoor Service Registry module is just the beginning of what’s possible in dynamic configuration management. The code is currently in use by almost every server we have running in the cloud, and has been working very well.  Future improvements to the code will include CLI tools, backup and restoration scripts, and a Global Lock handling system.

We’d like to thank the Kazoo developers for building a great robust Zookeeper module in pure-python. Special thanks goes out to Hanno Schlichting (hannosch) and Ben Bengert (bbangert) for helping us with some design tweaks and Kazoo issues.

The code is available on GitHub.  We welcome your feedback and contributions!

Matt Wise
Sr. Systems Architect