Sitevars¹ at Nextdoor — How engineers quickly push configuration changes in production

Luiz Scheidegger
Nextdoor Engineering
7 min readApr 28, 2020

--

Nextdoor engineers are always looking for ways to move faster. Making changes quickly and safely allows us to deliver more value to our members in neighborhoods around the globe, as well as to minimize the impact of bugs and disruptions we may come across. In this post, we’re going to share our experience building Sitevars, an internal system at Nextdoor which empowers engineers to push configuration payloads to all our servers across multiple regions in seconds.

Before Sitevars, engineers at Nextdoor had to go through a full write-commit-deploy cycle to update configuration values. This meant that the iteration cycle rarely took less than an hour — too long for fast product development.

The problem

As we iterate on our product, we come across many areas where features can benefit from having an easily-editable configuration payload with domain-specific parameters. Some examples of this include:

  • ML parameters — we use machine learning models extensively at Nextdoor to help moderate content and improve our members’ experience. As we iterate on these models, it’s very helpful to be able to quickly experiment with different threshold values for classifiers, as well as weights for model features, etc.
  • Global kill switches — we use circuit-breaker style switches to protect new features in production. If anything goes wrong with a new feature, we can quickly turn it off before it causes problems for our members.
  • Logging configuration — we use Sitevars to quickly update dynamic logging levels for different parts of our codebase without performing a new deploy to production, or even making code changes.

These examples all share a common need for a system that can propagate configuration changes — typically small JSON payloads — to our server fleet quickly and safely. When an engineer updates a configuration, that change should propagate to our servers across the world within seconds. We rely heavily on monitoring and automated alerts to notify engineers when a Sitevar change breaks something, so it can be quickly reverted to a healthy value.

In addition, the Sitevars service needs to be robust — a transient failure shouldn’t prevent a request from being served successfully. Finally, Sitevars payloads must be cheap to read. Some of our requests access dozens of Sitevars, so reads should take no more than a few microseconds.

Sitevars: fast-propagating, versioned JSON

Sitevars consist of two main parts: a Go service which manages the backing store for Sitevars payloads, and a set of client APIs in our Django application which handle service failure and local caching for improved performance. In addition, we built an internal tool that gives engineers a simple UI to create, search, and edit any Sitevars payload.

The Sitevars Service

The service component of Sitevars is a Go application which provides an API for callers to create, update, and fetch Sitevars payloads. Each payload contains a small piece of JSON (we currently limit this to 16KB in size), as well as common metadata fields — last author, update time, version, etc. Sitevars objects are stored in a globally-replicated DynamoDB table. The schema for this table is very simple — its hash key consists of the unique name of a configuration, and its range key is a version number:

The DynamoDB schema for Sitevars consists of a namespaced hash key, a version range key, and a JSON payload for the Sitevars value.

When a Sitevars object is updated, we insert a new row into the table, with the latest value of the payload and a new version number. This ensures that updates are non-destructive, and that reverting to a known-good configuration in case of problems is trivial.

Single-row fetches from DynamoDB typically take a few milliseconds to complete. While that cost isn’t too high for a single Sitevar, many of our endpoints fetch dozens of configurations, so minimizing this latency is critical. To accomplish this, the Sitevars service keeps an in-memory cache with the latest version of each Sitevar. As the working set is relatively small (each Sitevar is limited to 16KB, and we have a few hundred configs to date), the service can easily hold all Sitevars in memory. Because of this cache, the majority of fetches never make a roundtrip to DynamoDB. Another advantage of a small working set is that it allows us to trivially refresh the entire cache at a set interval. At the moment, this is done every 60 seconds.

Finally, we use a Redis pub-sub channel to notify the service when a Sitevar is updated, allowing it to more quickly invalidate a single entry in the cache. This serves a key iteration requirement — it allows engineers to observe their change in the product almost immediately.

The Sitevars service manages an in-memory cache with a short TTL, such that almost no Sitevar fetches require a roundtrip to DynamoDB. The service also publishes Sitevars updates to a Redis pub-sub channel to quickly propagate value changes.

Fast access to a Sitevars payload in the service is only half of the equation to ensure Sitevars fetches are efficient. Communication between our Django containers and the Sitevars service must also be as fast as possible. We address this in two ways: we deploy the Sitevars container as a sidecar to our Django application, and we use gRPC as a transport mechanism. Deploying the container as a sidecar ensures that calls between Django and Sitevars never leave a single host, and using gRPC (instead of, e.g., JSON over HTTP) reduces the p50 latency for requests from about 3–5ms to about 800µs. Since we built the gRPC server using grpc-gateway, that change was trivial to implement. We were quite surprised to find such a big performance improvement!

Deploying the Sitevars container as a sidecar (in the same host) along with the Django container ensures that the RPC between the two never leaves a single host.

Client-side APIs

The second main component of Sitevars is the set of APIs for developers in our Django codebase. These APIs provide a convenient, type-safe way to access Sitevars in our web application. One of their key features is that they require developers to provide, in code, a default value for their Sitevars. This serves two purposes: it allows engineers to write and commit code ahead of creating a specific Sitevar, and it provides a last-resort fallback in case the Sitevars service goes down for any reason.

However, using the last-resort code fallback for a Sitevar can be risky, especially as it becomes stale for configurations that have been heavily edited over time. Even if the API returns the default successfully, there’s a large chance the calling code is no longer factored to handle the default value without crashing. To mitigate this scenario, we maintain hourly S3 snapshots of all Sitevars, and bake them into our Django container. This allows us to ensure that, in the event the Sitevars service goes down, our web application can still use Sitevars that are no more than about one hour old.

When discussing the Sitevars service above, we talked about a caching and transport strategy that brought down the cost of fetching a configuration to just under a millisecond. However, we have one more trick up our sleeve to make this number even smaller: we maintain a request-scoped cache of any fetched Sitevars in our web application. This means that any Sitevar payload is never fetched into Django more than once per request. Any subsequent fetch of the same configuration is only a Python dictionary access away, at the cost of a few microseconds. This is especially useful for configurations that are fetched frequently, such as ones used to drive core pieces of our web infrastructure. When all of these strategies are put together, latency for fetching Sitevars falls into a bimodal distribution, where about half of all configuration fetches takes less than 100µs to complete (when they hit the per-request cache), while the other half takes between 500µs and 800µs (when they require an RPC to the Sitevars service).

Lessons Learned

As we developed and shipped Sitevars, we capitalized on the value of a few important lessons. First, simplicity is a key factor in the success of an internal tool. Even though Sitevars is a service aimed at engineers with a good understanding of our tech stack, its focus is still on a simple, predictable API and an easy to use editing tool. This encourages engineers across Nextdoor to use Sitevars in many more places than we had originally envisioned. Our use of DynamoDB global replication also automatically ensures that the values of Sitevars are unified across all AWS regions in which we operate. Finally, we were really surprised by the latency difference between HTTP+JSON and gRPC. When we made that change, we expected to see performance benefits, but were delighted by how significant the gains were.

Going forward, we plan to expand Sitevars functionality beyond our Django application, to support it across our other services as well.

Conclusion

Sitevars are deployed at scale at Nextdoor, where they help power and customize many different features. With the caching and transport architecture we shared here, we brought down the cost of fetching a single configuration from 5–7ms to less than 100µs. Sitevars serve close to 100k QPS during peak times, without impacting site performance or stability. If working on practical, performant infrastructure like this gets you excited, we’re hiring — come join us!

[1] The name “Sitevars”, as well as the inspiration for the benefits of this system, come from previous work in the industry. If you’re curious to learn more, check out this great paper.

--

--