Nextdoor Engineering

Nextdoor is the neighborhood hub for trusted connections and the exchange of helpful information, goods, and services. We believe that by bringing neighbors together, we can cultivate a kinder world where everyone has a neighborhood they can rely on.

Follow publication

3 Hard Lessons from Scaling Continuous Deployment to a Monolith with 70+ Engineers

--

Like many successful tech startups (Instagram, Slack, New Relic, and many more) a large portion of Nextdoor is a monolithic application. While we’ve also created a significant number of microservices — especially in places where high throughput is needed — a large amount of our product and feature work still takes place in our Django monolith.

Our developers work on unrelated components of the monolith at the same time

This means that the majority of our 70-member (and growing) engineering team is regularly working on a large monolithic Python codebase. We recently completed a very significant project to move from ~4 releases of the monolith per week to Continuous Deployment. See our related post How Nextdoor Made a 10x Improvement in Release Times with Docker and Amazon ECS.

With Continuous Deployment at Nextdoor, changes are released to production as soon as they are ready and as quickly as our build and deployment infrastructure can manage. Engineers land directly to master and we avoid long-lived branches. Furthermore, the workflow around deployment and rollback of the monolith is fully automated. There isn’t even a button that an engineer needs to push! Typically, we have dozens of releases per day in a fully-automated system where any engineer can deploy or rollback production.

High-level continuous deployment workflow

Making this operate smoothly has required lots of thought and polish on our workflow and the tooling around it. There isn’t much public information on how these details are managed at scale nor what kinds of tools you need to do this well. Since releasing code is at the core of our business, we invested in building a great tool to manage this complexity, named Conductor. We hope to open source Conductor in the near future. In this blog post we aim to share the biggest lessons we learned the hard way, and give a sneak peek into our tooling and workflow.

1) Continuous Deployment is much harder with lots of engineers and code

It’s one thing to practice Continuous Deployment (henceforth abbreviated to CD) with a microservice owned by a team consisting of a handful of engineers. A small team will introduce changes at a slower rate than a large team. When you have 70 engineers working on the same large application, the rate of change is much higher. Changes are being made in completely different areas of the application by different teams at the same time.

Microservices typically have fewer changes per release vs monolith

Furthermore, in a microservice the surface area for defects is much smaller. A defect in a microservice will generally be limited in terms of its potential blast radius compared to a defect in a monolith. In contrast, one small bug in a monolith can take down all functionality of the site. This is one of the downsides to a monolithic application — slowness or errors in one part can adversely effect the entire system, rather than just their local area. There is little isolation. Therefore, managing the high rate of change and blast radius is a critical ingredient in successful CD workflows.

2) Use trains to manage many changes

With CD of a microservice, you might be able to have the release size consist of a single change — i.e. do one release for each individual change. However at the scale of the Nextdoor monolith, this isn’t usually possible. Given the test, build and deploy overhead of our monolith (about 25 minutes total), if we released a single change at a time, changes would get backed up and it could take hours between engineers landing their changes on master and them going out to production. We aim to get changes live as quickly as possible once they’re landed on master. It becomes difficult and inconvenient for engineers should there be too large a delta or too much uncertainty around when the change they landed will be live on Production.

An example of a train, including number of commits, deployment time, and a “rollback” button.

This is where we introduce the concept of a “train”. A train is basically the smallest unit (or batch) of releasable changes, and we use the following rules to negotiate what is on a train.

Train creation logic
  • There can be only one train “running on the tracks” at a time.
  • When a change lands on master, if there is already an existing train, it is queued for the next train. To use a train-based metaphor, the changes are “waiting on the platform” until a new train arrives. Note that before a change can be landed on master, a large suite of automated tests must pass and it must be accepted by a human reviewer.
  • If there is no existing train, a new train will be created containing all changes up to HEAD of master. Thus, all of the queued changes or those “waiting on the platform” are put on this new train.
Train phases (left to right)
  • Trains go through three phases. First, Delivery. This phase includes build and deployment to our Staging environment.
  • Then, Verification. This includes both automated verifications such as unit test runs, smoke tests along with human verification. Human verification is tracked by polling the state of tickets — we use JIRA. Tickets are automatically created for each engineer with non-dark launched changes on the train once it is delivered to Staging.
  • Finally, Deploy. Once all of the verifications are complete — which means all the automated tests have passed and any manual verification tickets have been closed by the engineers— the train will auto-deploy to production.
Train in verification phase. Human verification completed, waiting on automated tests.
  • Once a train starts to deploy to production, a new train is implicitly created as soon as there are any queued changes. To maximize throughput, we don’t wait for deployment to complete before starting on the next train.
  • If there is a problem with a train, it can be manually extended via a button in the UI. Any engineer can press a button to include a fix or revert commit for something urgent. Train extension simply pulls in all queued commits — up to HEAD of master. This will slow down the current train, and therefore everyone else on it, so it should be used sparingly. Ideally, automated test coverage is good enough that you don’t find problems at this stage and there is no need to extend for a revert or fix.

One of the advantages of the way we have implemented trains is that releases are typically quite small. The smaller your releases the better — since the likelihood of a problem increases with every additional change. Furthermore, when something does go wrong, it is much easier to find the offending change from within a small batch size than a large one.

Engineers are encouraged to put their changes behind a feature flag and we provide a framework called Feature Config which powers this. If a change is behind a feature config, we let engineers bypass the manual verification process. The rationale here is that the feature can be manually verified at any time, potentially with a small number of users initially, and if it has a bug, it can be switched off instantly.

3) Kill release teams — democratize deployment workflows instead

If you’re an engineer who has just finished a change, you have a strong incentive to see it released as soon as possible. Similarly, if you as an individual engineer introduce a defect you have a very strong desire to fix it quickly and get that fix live on production immediately. You’ll be much happier and more productive if you can drive this workflow yourself.

If the workflow is instead driven and controlled by a dedicated team (e.g. a “release team”) the incentives aren’t necessarily the same. For example, if a bad data migration is in a release, it’s best to give the people who had changes in that release visibility so that they can quickly debug it and fix it.

Align Incentives

One of the philosophies at Nextdoor is to align incentives at the right levels to boost productivity as much as possible. Toward this goal, we pick a random individual who has a change on the train to be the “train engineer”. The train engineer is responsible for frontline triage of issues. They act as a release engineer but for a very small release. For example, if an automated test fails during the verification phase of their train, they will be sent a Slack message telling them to triage the issue and resolve the problem. This is typically resolved by them reading test failure output, triangulating a likely candidate change on the train, and then co-ordinating with that change’s author to get a fix or a revert onto the train to get it moving again.

In practice, the train engineer system works well. If a particular train is delayed or has issues, there is clear individual responsibility for tracking down verifiers, performing a rollback or extending the train with a fix. Since the train engineer by definition has a change on the train themselves, they also have a strong incentive to resolve problems quickly. By selecting the train engineer at random, release process knowledge and load is gradually spread throughout the organization.

Get Humans Out Of The Way

Another observation was that human verification can be extremely slow. For example, perhaps an engineer has a change on the train and then they go into an interview for an hour or a series of meetings — meanwhile everyone else is held up. To tackle this problem, we have created a culture where it’s agreed that excessively slow verifications aren’t acceptable behavior. It’s considered as serious a breach of etiquette as breaking tests in master. Other people on the train will go and track down the person with some urgency, or possibly find someone else to verify it — or in extreme cases revert the offending commit entirely.

While this culture helps to a significant degree, humans will make mistakes. The best way to avoid having a human forget to do their verification is to make it unnecessary for them to verify in the first place. To aid in this we have introduced additional tools and workflows where you can bypass needing to verify your change on Staging entirely and therefore not risk holding everybody up:

  • Pre-land verification environments, called Preview Environments. These are unique, per-code-review, fully-isolated Staging-like environments which engineers can use to verify their changes before they land them on master. If they use this capability, they bypass Staging and its manual verification requirements entirely. Thus there is no human verification requirement for their change during the release process.
  • Feature Config. If your change is behind a feature config, you don’t need to verify manually on Staging since your change has no impact on Production by default. It can be rolled out to a limited audience and quickly turned off if it has bugs or issues.

Conclusion

At Nextdoor, we have been continuously deploying dozens of releases per day of our large monolith for about 10 months. We have 70+ engineers working on this codebase. The benefits of Continuous Deployment to the speed of our product development organization have been significant.

However, the unique challenges posed by both a monolithic application and a rapid pace of change have required us to invest heavily in workflow optimization and tooling to achieve this. It took a team of 3 engineers about 9 months to figure out all of the intricacies of the workflow, developer user experience, and to then build the associated tooling. At the core is the Conductor microservice, which glues together Jenkins, Slack, Github and JIRA to present a coherent user interface and enforce rules. Since every engineer at Nextdoor interacts with Conductor to ship their code we wanted to make it as pleasurable as possible to use.

We plan to open source Conductor — which was designed to be modular with pluggable support for third party services— in the near future.

I’d like to give a huge shout out to the engineers on the Dev Tools Team who have contributed to this project and this blog post: Steve Mostovoy, Rob Mackenzie, Mikhail Simin and Alex Karweit.

Find this sort of stuff cool? The Nextdoor engineering team is always looking for motivated and talented engineers.

--

--

Published in Nextdoor Engineering

Nextdoor is the neighborhood hub for trusted connections and the exchange of helpful information, goods, and services. We believe that by bringing neighbors together, we can cultivate a kinder world where everyone has a neighborhood they can rely on.

No responses yet

Write a response