Nextdoor Engineering - Medium

Let AI Entertain You: Increasing User Engagement with Generative AI and Rejection Sampling

Jaewon Yang — Mon, 16 Oct 2023 17:03:52 GMT

Generative AI (Gen AI) has demonstrated proficiency in content generation but does not consistently guarantee user engagement, mainly for two reasons. First, Gen AI generates content without considering user engagement feedback. While the content may be informative and well-written, it does not always translate to increased user engagement such as clicks. Second, Gen AI-produced content often remains generic and may not always provide the specific information that users seek.

Nextdoor is the neighborhood network where neighbors, businesses, and public agencies connect with each other. Nextdoor is building innovative solutions to enhance the user engagement with AI-Generated Content (AIGC). This post outlines our approach to improving user engagement through user feedback, specifically focusing on Notification email subject lines. Our solutions employ Rejection sampling [1], a technique used in reinforcement learning, to boost the engagement metrics. We believe our work presents a general framework to drive user engagement with AIGC, particularly when off-the-shelf Generative AI falls short in producing engaging content. To the best of our knowledge, this marks an early milestone in the industry’s successful use of AIGC to enhance user engagement.

Introduction

At Nextdoor, one of the ways to drive user growth and engagement on platform is through emails. One of the emails we have is called New and Trending notifications, where we send a single post that we think the user might be interested in and want to engage with. As part of sending an email, we need to determine a subject line of the email for the email audiences. Historically, we simply pick the first few words of the post being sent to be the subject line. However, in certain posts, these initial words are often greetings or introductory remarks and may not provide valuable information to the user. In the provided image example below, we observe a simple greeting, “Hello!”

Figure 1. New and Trending email where we show a single post. Prior to the Gen AI systems we build, we use the first words of the post as the subject line (Life and Mother Nature always find a way!)

In this work, we aim to use Generative AI technologies to improve the subject line. With Generative AI, we aim to generate informative and interesting subject lines that will lead to more email opens, clicks and eventually more sessions.

Writing a good subject line with Generative AI is challenging because the subject line needs to satisfy the following criteria. First and foremost, the subject line needs to be engaging so that the users want to open the email. To see if ChatGPT API can write engaging subject lines, we tried generating subject lines with ChatGPT API with a small traffic A/B test, and found that the users are less likely to click on emails if we use subject lines made by ChatGPT API (e.g. Table 1). As we show later, we tried to improve the prompts (prompt engineering) but the results were still inferior to the user-generated subjects. This finding implies that Generative AI models are not trained to write the content that is particularly engaging to our users, and we need to guide Generative AI models to increase user engagement.

Table 1. Subject line made by ChatGPT API and its CTR. ChatGPT API’s subject line is more informative but looks like a marketing phrase, and produced only 56% clicks compared to the user-generated subject line.

Second challenge is that the subject line needs to be authentic. If it reads like a marketing phrase, the email will look like spam. The example in Table 1 “Support backyard chickens in Papillion, NE!” shows this issue.

Third, the subject line should not contain hallucinations (a response that is nonsensical or not accurate). And it is well known that Generative AI is vulnerable to hallucinations [2]. For example, given a very short post saying “Sun bathing ☀️”, ChatGPT API in Table 1 generated the subject line “Soak Up the Sun: Tips for Relaxing Sun Bathing Sessions”, which had nothing to do with the post content.

We developed a novel Generative AI method to overcome the three challenges faced by the ChatGPT API mentioned above. We made three contributions:

Prompt engineering to generate authentic subject lines with no hallucination: Given a post, ChatGPT API creates a subject line by extracting the most interesting phrases of the post without any rewriting. By extracting the user’s original writing, we are able to prevent marketing phrases and hallucinations.
Rejection sampling with a reward model: To find the most interesting subject line, we develop a reward model whose job is to predict if the users would prefer a given subject line over other subject lines. After ChatGPT API writes a subject line, we evaluate it by the reward model and accept it only if its reward model score is higher than the user-written subject line’s score. This technique is called Rejection Sampling and recently introduced to Reinforcement Learning for Large Language Model training [1].
Cost optimization and model accuracy maintenance: We added engineering components to minimize the serving cost and stabilize the model performance. By using caching, we reduced our cost to 1/600 compared to the brute-force way. By daily performance monitoring, we can catch if reward models fail to predict which subject is more engaging due to external factors such as user preference drift and address it by retraining.

We believe that this framework is generally applicable when off-the-shelf Generative AI fails to improve user engagement. We also analyzed the importance of each component in our design. Even with the aforementioned prompt engineering, ChatGPT API did not necessarily produce more engaging content. This highlights the necessity of the rejection sampling component: in such cases, we can develop another AI model as a reward model and use the Generative AI’s output only if the reward model approves [1].

Proposed Method

For every post, we employ the following system to create a subject line. It’s important to mention that we generate a single subject line for each post, without personalization. This decision was made to minimize computational cost. Exploring cost-effective methods for implementing personalized subject lines will be an interesting future work.

Model Overview

Figure 2 illustrates our approach. We develop two different AI models.

Subject line generator: This model generates a subject line given a post content.
Reward model (Evaluator): Given a subject line and the post content, this model predicts if the given subject line would be the better subject line than the user-generated subject line.

Figure 2. Overview of our approach.

Given a post, the Subject line generator produces subjects in Figure 2 (green boxes). The reward model compares the OpenAI API subject line (green) with the user-generated subject line (red), and selects the more engaging one. For the top post, the OpenAI API subject line contains more relevant information and is selected. For the bottom post which was about a health alert, the reward model selects the user-generated subject. While the OpenAI API subject line shows the main content of the alert, the reward model picks the user-generated subject because it shows the importance of the post and thus is more engaging.

Developing Subject Line Generator

We use OpenAI API without fine-tuning. In the prompt, we require that OpenAI API extracts the most interesting part of the post without making any change. This way of extracting user content provides multiple benefits: First, it removes hallucinations. Second, it keeps the subject line authentic as OpenAI API does not rewrite the original content. To test the prompt engineering, we A/B tested generator outputs without reward models. We found that asking OpenAI API to extract in the prompt improves Sessions by 3% relatively compared to asking OpenAI API to rewrite the subject line from scratch (See the Results section for the details).

Developing Reward Model

We fine-tune OpenAI API to develop a reward model. This is the main innovation we applied on top.

Training data collection: The challenge is to collect training data on which subject line was more engaging. Manual annotation is not possible because there are no rules deciding what subject line is more engaging. We found that the subject lines that we thought to be more engaging than the user-generated ones turned out to be less engaging (Table 2).

Table 2. Emails with a user-generated subject (left) generated 3x as many clicks as the emails with OpenAI API-generated subjects on the right.

To tackle this issue, we collect training data via experimentation. For each post, we generate subject lines in two ways. One way is to use user-generated ones and the other is to use the OpenAI API generator described above. Then we serve 2–3% users (~20k) that are randomly selected with each subject line. The goal is to learn which subject line was more engaging through click data.

Model training: We used OpenAI API to fine-tune with the labels we collected. We used ~50k examples and 40% of examples had the OpenAI API subject as the winning subject and the rest had the user subject as the winner. Given a subject line and post content, our model is fine-tuned to predict if the subject line would generate more engagement (clicks) than the user-generated subject line. The model is asked to predict if the subject line is more engaging and output “Yes” or “No”.

Training details: We used the smallest OpenAI API model “ada” for fine-tuning. We found that larger models did not improve the predictive performance despite higher cost. We added a logit bias of 100 for “Yes” and “No”. These biases boost the probability for the model to output “Yes” or “No”. We tried to change the number of epochs and selected the model with 4 epochs, but we did not see much difference in offline performance after 2–3 epochs.

Engineering details: We added the following components for optimization and safeguarding.

Caching: For each post, we cache the outputs of our model. By processing each post only once, we reduced the cost to 1/600. In other words, each post gets sent 600 times on average and we process the post only once instead of 600 times. Caching also optimizes the OpenAI API usages (the number of tokens and the number of requests).
Reward model performance maintenance: We monitor the reward model’s predictive performance daily, using the next day’s user clicks after the training phase as the ground truth to compare with the model’s output. Model’s predictive performance can change because our users’ preference may change and the content in Nextdoor can shift in the writing styles or topics.
For monitoring purposes, we collect the engagement performance of different subject lines in the following way. We created a “control” user bucket where we always send emails with the user-generated subject and a “always OpenAI API” bucket where we always send with the OpenAI API subject, regardless of the reward model’s output. From these two buckets, we know the ground-truth on which subject line was more engaging, and measure the reward model’s accuracy. If the accuracy goes down by 10+%, we retrain the reward model with new data.
Retries with Fallback: Since OpenAI API may return an error due to the rate limit or transient issues, we added retries with exponential backoffs with Tenacity. If we fail after a certain number of retries, we fallback to the user-generated subject.
Controlling the length of output: We found that the Subject line generator would write a subject line longer than our desired length (10 words). This happened even if we specified the 10 word limit in the instruction and added examples. We post-processed the generator output by cutting the first 10 words from the generator’s output. We A/B tested different word limits and found that 10 is the optimal value.

Results

We did A/B tests with different versions of the subject line generator, and with and without the reward model. For the generator, we tested the following options

Writing with OpenAI API: We ask OpenAI API to “write an engaging subject line for a given post”. This was the first version we tested without much prompt engineering.
Extracting with OpenAI API: We ask OpenAI API to extract the most interesting part and provide 5 examples. We also add requirements in a numbered list such as “Do not insert or remove any word.”, “Do not change capitalization”, “If the first 10 words are interesting, use them as a subject line”. We tried 4 different versions of prompts and picked the best version by A/B test metrics.

For the A/B test metrics, we primarily focus on Sessions. A session is an activity sequence made by the same user within a certain timeframe, and sessions quantify the number of unique user visits.

Table 3 shows the results on Session lift compared to the “control” bucket where we use user-generated subject lines. In addition to the session metrics, our final model (last row) increased Weekly Active Users by 0.4% and Ads revenue by 1%.

Table 3. Session lift compared to the user-generated subject lines from A/B tests. The final model (last row) achieved 1% lift in sessions.

Here is what we learned from A/B tests:

Prompt engineering improves the performance but has a ceiling. After a few iterations, the A/B test metrics showed only marginal improvements, failing to beat the control.
Finding the “optimal” prompt is an elusive task, as the space of potential prompts is boundless, making it difficult to explore. Moreover, there is no established algorithmic or systematic method for enhancing prompts. Instead, the task relies on human judgment and intuition to update the prompt.
Reward model was the key factor in improving sessions.
Predicting popular content is challenging, as is the reward model’s task of forecasting popular subject lines, which currently achieves about 65% accuracy. Enhancing the reward model’s performance by leveraging real-time signals like the current engagement numbers for the subject can be an interesting future work.

Conclusions

We developed a novel Generative AI system to increase user engagement by combining the reward model and prompt engineering. Our systems have engineering components for cost saving and monitoring. A/B tests showed that our systems can deliver more engaging subject lines than the user-generated subject lines.

There are many avenues for future work. First is to fine-tune the subject line generator. In this work, we used vanilla ChatGPT API as the generator. Instead, we can fine tune OpenAI API with the most engaging titles that the reward model identifies. For each post, we generate multiple subject lines and use the reward model to pick the winner. Then we use the winner subject to fine tune the subject line generator. This approach is called Reinforcement Learning by Rejection Sampling [1].

Second is to rescore the same post daily. Currently, we pick the best subject line with a reward model once and never rescore. However, as time goes on, we may be able to see which of the OpenAI API subject line or user-generated subject line is getting more engagement, and our reward model can predict more accurately. Third is to add personalization without significantly escalating computational costs.

Acknowledgments

The post was written by Jaewon Yang and Qi He.

This work was led by the Generative AI team with cross-org collaboration between Notification team and ML teams. We would like to give a shout out to all the contributors:

Jingying Zeng, Waleed Malik, Xiao Yan, Hao-Ming Fu, Carolyn Tran, Sameer Suresh, Anna Goncharova, Richard Huang, Jaewon Yang, Qi He

Please reach out to us if you are interested to learn more — we are hiring!

References

[1] Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models, Arxiv preprint, 2023

[2] Ji et al. Survey of Hallucination in Natural Language Generation, ACM Computing Surveys, 2022

Let AI Entertain You: Increasing User Engagement with Generative AI and Rejection Sampling was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Pre-trained to Fine-tuned: Nextdoor’s Path to Effective Embedding Applications

Karthik Jayasurya — Thu, 07 Sep 2023 11:31:32 GMT

Background

The majority of ML models at Nextdoor are typically driven by a large number of features that are primarily either continuous or discrete in nature. The personalized features usually stem from historical aggregations or real-time summarization of interaction features, typically captured through logged tracking events. However, representing content through deep understanding using information behind it (text/image) is crucial for modeling nuanced user signals and better personalizing complex user behavior across many of our products. In the rapidly evolving field of NLP, utilizing transformer models to perform representation learning effectively and efficiently has become increasingly important for user understanding and improving their product experience.

Towards that, we have built a lot of entity embedding models spanning entities such as posts, comments, users, search queries & classifieds. We first leveraged deep understanding of content and used that to derive embeddings for meta entities like users based on their past interacted content. These powerful representations are found to be very crucial towards extracting meaningful features for some of the biggest ML ranking systems at Nextdoor such as notifications scoring and feed ranking. By making them readily available and building to scale, we can drive adoption of state-of-the-art reliably and put them in the hands of ML Engineers for rapidly building performant models across the company.

This blog primarily focuses on how we iterated on the development of embedding models, how they are featurized and served at large scale into various product applications as well as some of the challenges encountered during this process. We summarize the evolution of work across three sections. In section 1, the focus is to leverage state-of-the-art pre-trained models to rapidly evaluate the value of embeddings models as feature extractors. Section 2 describes how to fine-tune embeddings using unlabelled data for certain products, whereas Section 3 demonstrates the use of labeled data to fine-tune embeddings for better task prediction. This work is driven by the Knowledge Graph Team at Nextdoor, a horizontal team that works in close collaboration with product ML teams as well as the ML Platform team who owns the ML training and serving platform and the FeatureStore service powering ML models at Nextdoor.

1. Leveraging Pre-trained models

The first generation of embeddings are built from pre-trained language models using the Sentence-BERT paradigm (https://www.sbert.net/). SBERT is well-known to produce better embedding representations compared to original BERT models [1]. The main goal here is to rapidly experiment with embeddings as features and realize their value in the product as quickly as possible. The text from content entities, viz. Nextdoor posts & comments, is extracted from post’s subject and body and comment text respectively, which is then fed into a multilingual text embedding model to derive respective entity embeddings for all countries Nextdoor operates in. For a given user, their historical interacted posts’ embeddings are weighted aggregated based on interaction type to inform user (interaction) embedding. Ex: Active interaction such as post creation/comment/click would have higher weight compared to a more passive interaction like impression. These signals are aggregated across both online (feed) and offline (emails) product surfaces to represent user embedding holistically and are updated daily for all users in the platform.

These features were found to be among the most important features for multiple ranking models and delivered significant performance lifts in key product OKR metrics across both notifications and feed when shipped in early 2022. The pre-trained models also served as a good proof-of-concept to build out reliable feature ingestion pipelines and monitoring systems identifying any potential feature drifts and disruptions. This helped form a robust playbook for deploying several next generation embedding features.

2. Fine-tuned embeddings from unlabeled data

The next generation of embeddings describes training custom models which are improvements over pre-trained versions by leveraging techniques of fine-tuning. The signals used to generate embeddings earlier come from user interactions across notifications and home feed products either directly or indirectly. In contrast, this section details a use case that makes use of unlabelled data to perform representation learning to improve user search experience.

Our neighbors use Nextdoor search to find useful local information by expressing intent explicitly. We tried to capture both long and short term intent to determine and serve user perennial (e.g.: home maintenance) as well as ephemeral needs (e.g.: lost & found). Search queries — while being high intent in nature, are inherently short and noisy. A searcher might try multiple variations of a query successively in order to get their intent fulfilled as much as possible. Additionally, due to the nature of local search, relying on labeled feedback from search results may not fully capture user intent due to limited liquidity.

To fully capture user intent signals, we rely on a self-supervised training strategy to learn fine-tuned representations for any given query. Specifically, we first built an SBERT backed query embedding model that learns to embed search queries in lower dimensional space. Then, we aggregate embeddings from user queries across different time windows (weekly/monthly/quarterly periods) to generate multiple user (intent) embeddings. The same model also extracts the intent of a post to generate the corresponding post embeddings. The resulting user, post and query embeddings are transformed and featurized as described in the later section to improve the performance of the ranking models.

The query embedding model is originally built to drive contextual query expansion in Nextdoor search pipeline [2]. This sentence transformer model is trained on historical search queries in order to best learn query representations. We first collected search logs that consisted of sequences of search queries within a session across all searchers over a period of time. Then, they are pre-processed using traditional NLP methods like lemmatization, spell checking, deduplication etc. to form a clean corpus of tokens, which is composed of n-grams (n=1,2,3) and whole queries. To generate a training dataset, we created positive pairs of tokens occurring within a user search session and negative pairs randomly occurring across sessions. Contrastive learning with cosine similarity loss is used to train the underlying model.

For the query expansion use-case, this model drove better contextual search results by identifying related candidates improving recall. This helped not only improve key search metrics across content search and product search in For Sale & Free but also reduced the rate of null queries significantly compared to prior word embedding models. We also leveraged HSNWlib [3], an approximate nearest neighbors library to implement this deep learning based query expansion further improving expansion latencies by more than 10x. For notification & feed use cases, intent features generated from transformations of post & user embeddings helped achieve significant positive impact on our top line engagement metrics. Although features can only be computed for searchers and are of low coverage overall, this explicit signal is found to be very useful in improving the overall search experience.

3. Fine-tuned embeddings from labeled feedback

In the next evolution of embeddings, we additionally leverage user feedback to fine-tune models further. The pre-trained entity embeddings have served us well over a year, but they are off-the-shelf models trained using public benchmark datasets. As such, their semantics are quite different in nature from the Nextdoor domain. Moreover, their high but fixed model dimensionality contributes to significant storage and serving costs, especially when user embeddings are updated for all Nextdoor neighbors daily. To address these, we built a two-tower framework to fine-tune embeddings with user feedback collected across Nextdoor surfaces while reducing dimensionality, customizing to our domain, and being cost effective.

The fine-tuned models are developed and trained in phases, incrementally adding complexity. In the first phase, the inputs to post and user towers are pre-trained embeddings, which are then transformed using multiple FC layers, reducing dimensionality at each step. The standard cross-entropy is used as a loss function to predict the task of notification clicks for a given user and post. To generate a training dataset, we sampled from random explore logs to reduce selection bias, the same process as that of the downstream ranking model. Once the model is fully trained, the last layer generates fine-tuned user and post representations.

These pytorch models are trained on millions of records using SageMaker GPU instances with varying hyperparameters, and the model with the best offline performance is chosen to generate & store fine-tuned embeddings into FeatureStore. The earlier described playbook is followed to build and monitor offline and online feature pipelines. Serving these cached features to downstream models has shown promising lifts in all engagement metrics (CTR/sessions/contributions/DAU/WAU) while keeping guardrail metrics that measure harmful/hurtful content distribution across the platform neutral.

In the next phase, we fed the post tower directly with text extracted from the post entity, allowing us to fine-tune parameters of the SBERT model. The test AUC score is used as a benchmark to determine how many layers and transformer blocks to unfreeze for trying out different training schemes along with optimization of hyperparameters of typical DNN models. The best model also improved the user — post cosine similarity of fine-tuned embeddings by up to 16% when compared to respective pre-trained versions — an additional evaluation criteria of intrinsic quality of improvement in representations. It is also noteworthy that this quality improvement is achieved while reducing dimensionality by more than 10x!

In the most recent phase, we extended into multi-task learning (MTL) setup modeling both notification clicks and feed actions to jointly optimize learning of fine-tuned embeddings. Again, these objectives mimic downstream rankers exactly to make sure learnt embeddings directly optimize downstream tasks. MTL models have the added advantage of learning a single model across multiple product surfaces thereby reducing operational burden and maintenance costs, while leveraging knowledge transfer across shared tasks for better representations. The feed and notifications surfaces are highly related as clicking on email notifications lands directly into the pinned view of the post in the newsfeed. Additionally, most actions on home feed are used as features in notification ranker making these tasks very related.

Using embeddings in ML models

As most of our downstream production models are tree based models, they don’t directly integrate with vector features like embeddings, like that of Deep neural networks. Therefore, we primarily use outputs from embedding models as feature extractors into downstream models. Specifically, we rely on transformations like cosine similarity & dot products across these entity embeddings in order to generate meaningful affinity features. While transitioning into neural network systems is currently underway — these vector transformations provide a neat way to integrate embedding based features into existing models and enable rapid experimentation for assessing performance lift of new deep features.

We first create schema and declare feature groups corresponding to each embedding to host within our in-house FeatureStore. Then, content based embedding features are ingested into our Featurestore in near real-time using task worker jobs as they get created/updated. For users, the daily scheduled jobs in Airflow compute embedding aggregations based on pre-specified lookback windows, weighted across various interaction types, and are batch ingested into Featurestore. Once systems are set up to ingest all relevant embeddings with appropriate TTL, we then write logging code to compute and log the derived features such as cosine similarity and dot product between user & post, user & user. Specifically, in feed, these features would represent affinities between post vs viewer, viewer vs author and analogously between post vs recipient, sender vs recipient in notifications world. Similarly, we also compute affinities across comment entities to inform activity based ranking in newsfeed. The data obtained from feature logging is used to train downstream ML ranking models, to avoid online-offline skew, and the model with best offline performance lift with new features is promoted for online AB test evaluation and ramping further towards majority member experience.

Challenges & Future

Multiple entity embeddings i.e user, post, comment, query etc have been successfully integrated into various product surfaces at Nextdoor at large scale. In the past, models based on comment embeddings helped foster and cultivate kinder conversations to improve platform vitality metrics [4]. More recently, contextual topic embeddings are also developed using BERTopic [5] to achieve coarser level personalization of content to neighbors, while informing us about prevalence of content categories and types across the platform. We are also experimenting with image embeddings using CLIP [6] to leverage image/video information behind content.

In addition, as an extension to labeled fine-tuning, we plan to further improve representations along two dimensions. One is by concatenating representations with additional features such as image embeddings and existing interaction features to leverage multimodal and dense signals. The other is to extend tasks to other surfaces such as ads, For Sale & Free (marketplace) etc to make representations more holistic across products. Once downstream models are fully modernized to DNN-based methods, the embeddings can be integrated into the model directly without losing any information from computation of transformations.

As we build more and more embeddings capturing different signals, we also need to be mindful of additional incurred costs from new features. Some of the initial challenges of serving high dimensional vectors during inference are mitigated by performing embedding transformations directly within FeatureStore rather than passing embeddings across microservices minimizing network bandwidth and scaling costs. This worked well with tree based models, however in the future, serving embeddings directly with DNN models can add up costs. Caching and serving fine-tuned embeddings can help control dimensionality while incorporating domain specific knowledge. This allowed us to rapidly experiment and quickly evaluate ROI at a smaller scale, justifying overall costs. From an infra standpoint, we found that optimizing the payload format of embedding features as well as sequencing of calls to efficiently read/write from FeatureStore greatly reduces overall costs at full scale.

Acknowledgments

This work would not have been possible without close cooperation and collaboration with various ML product partners (Notifications/Feed/Search/Vitality) as well as significant support for ML platform and FeatureStore service from ML Platform team. I would like to take this opportunity to give a huge shoutout to all the dedicated Nextdoor folks from these teams behind this endeavor.

Nextdoor is building the largest Local Knowledge Graph (LKG) in the world. The local knowledge graph inherited in our neighborhoods is Nextdoor’s unique proprietary data that can be used to enable personalized neighborhood and neighbor experiences. The Knowledge Graph team is focused on understanding neighbors and content by creating standardized neighbor/content data using state-of-the-art ML methods.

Third-party large language models (LLMs) such as GPT and the corresponding dialogue applications like ChatGPT, which are built upon these language models, lack access to the specific local knowledge of Nextdoor. As a result, they are unable to offer location-based services to our users as we desire. It is crucial for us to develop in-house custom LLMs that leverage our unique local knowledge graphs. We are building our own large language models (LLMs), that are based on top of Nextdoor’s raw content and the structured knowledge graph to power multiple products.

Please reach out to us if you are interested to learn more — we are hiring!

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
https://engblog.nextdoor.com/modernizing-our-search-stack-6a56ab87db4e
https://github.com/nmslib/hnswlib
https://engblog.nextdoor.com/using-predictive-technology-to-foster-constructive-conversations-4af437942bd4
https://maartengr.github.io/BERTopic/api/bertopic.html
https://openai.com/research/clip

From Pre-trained to Fine-tuned: Nextdoor’s Path to Effective Embedding Applications was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Securing Diversity in Cybersecurity

Kristen Beneduce — Tue, 02 May 2023 13:01:51 GMT

Panelists from Left to Right: Ronit Polak (Moderator), Kathy Wang* , Lea Kissner, Rupa Parameswaran, Olivia Rose, Jameeka Green Aaron *Correction: Kathy Wang is the former, not current CISO of Discord

At Nextdoor we build technology that empowers resilient, safe, and kind neighborhoods all over the world. Securing a product that empowers global communities requires diverse and inclusive teams, reflective of the communities we support.

Yet hiring and retaining the diverse talent needed to achieve our purpose remains an industry challenge. The gap is particularly evident in the cybersecurity field where 25% of the workforce and 16% of CISOs identify as female. According to the WiCyS State of Inclusion report 2023, women cite lack of respect and limited opportunities for growth in cybersecurity as top challenges accompanying lack of representation. We must keep working on it.

That is why Nextdoor welcomed the chance to celebrate diversity, alongside RSAC 2023, in Nextdoor HQ’s backyard this week and to partner with our neighborhood Women in Cybersecurity (WiCyS) Silicon Valley chapter. We are committed to building a diverse and inclusive workplace, and we are proud to work with organizations like WiCyS, who share the same values.

Nextdoor’s CISO TC Niedzialkowski kicked off with a warm welcome. CEO Sarah Friar framed the discussion by sharing how she launched her career by building a network at her first RSA conference as an equity analyst for Security Software at Goldman Sachs. She emphasized that diverse teams bring a variety of perspectives and experiences to the table, which ultimately leads to better problem-solving and innovation.

Left to Right: Tanvi Kolte Tiwari (WiCyS Silicon Valley Events Chair) introducing the panel, Attendees soaking into a fantastic intro by Sarah Friar (Nextdoor CEO) , TC Niedzialkowski (Nextdoor CISO) cheering on the panel

Moderator Ronit Polak, WiCyS Silicon Valley President, and CISOs Kathy Wang Lea Kissner Rupa Parameswaran Olivia Rose Jameeka Green Aaron, CISSP wowed the audience, covering everything from combating today’s top cyber threats including AI to imposter syndrome with incredible authenticity and humor. Closing us out Jameeka Green Aaron, CISSP called on WiCyS members to see themselves in the panelists and to thrive because representation matters!

Attendees ranged from aspiring cybersecurity professionals to a few celebrity leaders and practitioners from across, cybersecurity industry, government, and academia.

Learn more about Nextdoor’s initiatives to foster a holistically inclusive platform, and visit our careers page to see openings at Nextdoor.

Attendees Enjoying the Panel and Nextdoor Space

Securing Diversity in Cybersecurity was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Catching Anomalies Early in Mobile App Releases

Walt Leung — Wed, 11 Jan 2023 15:27:14 GMT

How Nextdoor catches mobile app release anomalies at 1% adoption

At Nextdoor, our mobile applications on iOS and Android serve content to tens of millions of weekly active users. At this scale, we run a weekly release process for both iOS and Android, shipping hundreds of changes across multiple teams and dozens of mobile engineers.

Our team uses several observability processes and rollout strategies to keep these deployments safe and scalable. We most notably use phased rollouts to minimize the impact of a potentially bad release. Phased rollouts allow us to gradually increase the adoption of users for a new app version. For example, we can have a new app version be released to only 1% of users on the 1st day, 2% of users on the 2nd day, and so on. That way, if a new release were accidentally shipped with an uncaught regression, having it at 1% rollout means it affects fewer users, reduces its severity level, and gives us more time to react.

However, for many of our critical business metrics where a failure can sometimes be silent, most out-of-the-box observability approaches don’t work with phased rollouts. This is largely due to two problems:

Observability typically happens at an aggregate level. For example, we look at app sessions or revenue on a daily basis, across all users for a platform.
The behavior of early adopters on an app version differs from the median behavior of all users. Most importantly, early adopters are more active, almost by definition, to be in an early rollout of the new app version.

At Nextdoor, Daily Users are more likely to adopt releases over Weekly Users, Weekly Users over Monthly Users, and so on.

For example, consider an app session regression on a hypothetical iOS version v1.234.5 released March 4. If we had unknowingly introduced a regression where we didn’t count an app session 5% of the time, at a 1% rollout, our aggregate impact would be expected to be roughly 0.05 x 0.01 = 0.05% of all iOS app sessions, which is practically impossible to detect (read: noise) with aggregate-level observability. Even worse, early app adopters skew more active, which means that maybe we should expect 0.06% of all sessions impacted. Or maybe 0.07% of all sessions impacted. In short, it’s hard to tell exactly what our aggregate impact should be.

However, when iOS release v1.234.5 reaches full rollout in a week, a 5% app session regression would be business critical. We can detect the app sessions drop once it reaches full rollout by looking at week-over-week or month-over-month metrics, but by that point, several days would have passed.

Stacked graph. Top trendline shows our app sessions, which has a clear regression starting March 7 with a low point at March 14. Bottom trendline shows the release adoption over time due to phased rollouts.

How can we detect these issues on day 1, at 1% rollout?

A simple approach would be to normalize our business metrics to the total number of users on the release, and turn all metrics into relative metrics (e.g. on v1.234.5, app sessions per active user per app version). Unfortunately, as mentioned earlier, we can’t directly compare the app sessions from users who have adopted a release to those who haven’t as their underlying characteristics are too different.

What we’re trying to solve for these early adopters is: what is the difference between their actual app sessions after adoption compared with their hypothetical app sessions had they never adopted the release in the first place, or an unobserved counterfactual? In statistics, we can measure this through difference-in-differences analysis.

For iOS release v1.234.5, app sessions over time of users who adopted the new app release on March 4 (teal) vs app sessions over time of users who did not adopt (gray).

Difference-in-differences analysis is a simple causal inference method we can apply here to estimate this effect by accounting for the separate time varying effects of users that have and have not adopted a release:

For users who adopted the release, calculate the difference in their app sessions three days before and three days after the release period. In this case, we observed a -0.02 decline in app sessions.
Do the same for users that have not adopted the release. In this case, a +0.20 increase in app sessions.

Assuming trends would have otherwise remained constant (pre-trend assumption), we would have expected app sessions of release adopters to increase by +0.20 like we observed with non-adopters. However, they instead decreased by -0.02. We calculate the difference in differences to estimate a comparison against an unobserved counterfactual:

-0.02–0.20 = -0.22 decrease in app sessions due to iOS release v1.234.5

In practice, we don’t just calculate this in aggregate. We first make sure that our two cohorts exhibit similar behavior pre-adoption (pre-trend assumption). This is a critical step to difference-in-differences analysis. With a sample size over hundreds of thousands of users, we can achieve high confidence in similar pre-trend behavior with a simple standard deviation bound over the preceding few days to adoption. If this behavior holds, we then fit a linear regression model that estimates the average effect of a release for any particular metric:

y = β0 + β1* Time_Period + β2* Treated + β3*(Time_Period*Treated) + e

In the case of v1.234.5, we can measure statistically significant negative effects across multiple app sessions metrics.

Average % lift of metrics we ran App Release Anomaly Detection on for v1.234.5

With this difference-in-differences approach, we are now able to flag the app sessions decline due to v1.234.5 on March 5th, 10 days earlier than we normally would have been able to using week over week figures. We also mitigate the need to factor in external variables such as seasonality or day of week. This not only helps in diagnosing the source of the decline to a specific app release, it also isolates the regression to less than 1% of iOS users.

App Release Anomaly Detection allowed us to discover and fix the release regression at 1% rollout, before our aggregate observability even showed a drop.

App Release Anomaly Detection is one of the many tools we’ve built at Nextdoor to give us observability into our releases while iterating quickly. It is one of the foundational elements that allows us to deploy major app releases on a weekly cadence and have confidence in our stability. Operationalized, App Release Anomaly Detection has helped us prevent nearly all severe critical client-side regressions and gives us peace of mind to release bigger changes at a more rapid cadence.

If this type of cross-functional work between platform engineering and data science at scale interests you, we’re hiring! Check out our Careers page for open opportunities across all our teams and functions.

Written by Walt Leung and Shane Butler, with support from Hai Guan, Charissa Rentier, Qi He, and Jonathan Perlow.

Catching Anomalies Early in Mobile App Releases was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Typeahead Search at Nextdoor

Jerry Tian — Wed, 06 Jul 2022 19:49:06 GMT

Background

In a thriving community, people are connected to their friends and local businesses. Nextdoor is the hyperlocal platform that mirrors these offline relationships. Every day, through active discussions on the platform, new relationships are formed and existing ones strengthened.

For example, a Nextdoor user can create a post like “I really like @XYZ cafe. @John is a hard working business owner and we should all support him by buying a cup of delicious latte!” Here, the post is created by at-mentioning (via the @ symbol) nearby businesses and users. From this post, users in the neighborhood can contribute by at-mentioning others to be part of the comment threads. As a result, John’s cafe thrives and acts as a neighborhood hub where new friends are made.

Every month, millions of these mentions are created in various discussions (including lost dogs!). In addition to posts and comments, a user can type into the search box and see, among other things, nearby users and businesses. All these features are powered by the same autocomplete service — a set of APIs to ingest data and handle typeahead search of different entity types (businesses, users, keywords etc) on Nextdoor.

This post focuses on how we built a proximity-based typeahead service to power typeahead use cases at Nextdoor.

Proximity-Based Typeahead Search as a Service

Any good search experience can be boiled down to two core components:

Relevance: Given a search query, whether the user sees relevant results or not. As a hyperlocal social network, relevancy is heavily weighted by geo proximity.

2. Low latency. Google Search found that

a 400 millisecond delay resulted in a -0.59% change in searches/user. What’s more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior.

For a good autocomplete experience, as users type, relevant results should show up instantaneously.

To meet the product requirements, we set out to build a service with the following design goals:

Low latency. There are hundreds of millions of entities on Nextdoor. The search latency at the service level should be less than 50ms.
Horizontally scalable to meet future scaling needs (we scale by adding more nodes).
Extensible. Typeahead search is a foundational API that enables other product features, so it should be easy to add other types of entities in the future.
High throughput for writes. We want to be able to index hundreds of millions of entities in a matter of hours.
Ease of operation and maintenance. When we index records, we should not impact production traffic.

Implementation

We landed on an in-memory-based solution that leverages geohash. At a high level, geohashing divides the earth into multiple zones based on latitude and longitude. It provides a good way to shard a large data set into buckets based on a zoomed-in level. Entities in the same bucket are in close proximity.

We used Uber’s open source geohashing library called H3.

For handling typeahead search, we decided to use sorted sets. This gives a set of benefits:

In-memory storage gives us the best possible latency characteristics for handling typeahead search.
It is easy to maintain. We can rely on redis persistence without having to handle durability ourselves.
By following the Command Query Responsibility Segregation (CQRS) pattern, we are able to index hundreds of millions of entities in a matter of hours with no impact to the serving of production traffic. Ingestion is handled by redis primary nodes in the cluster, and updates are then replicated to the read-only nodes which handle the typeahead search. Replication lag is less than 10ms.

With these two core pieces in place, we built a set of APIs that work together to handle all aspects of typeahead search:

* indexing_api (ingestion)

* typeahead_api (search)

* ranking_api (ranking by entity types)

Ingestion Path

Here is an example of the ingestion flow for businesses (Starbucks with id: 5, latitude: 47, and longitude -122):

For users, the typeahead search API works for both first- or last-name prefixes. Here is an example of the ingestion flow for users (Steve Jobs with id 4, latitude 47, and longitude -122):

Query Path

With the above structure in place, typeahead search retrieves results with a simple look-up using entity type, geohash key, and prefix. We then hydrate and rank the results before returning them to the client.

What we have today

The service has been running since August 2021. Every month we are handling hundreds of millions of typeahead search requests, and millions of comments with at-mentions are created. The service level for P95 search latency is less than 30ms.

Future work: typeahead for 1+ degree connections

To handle typeahead search with 1+ degrees (friends of friends), we can

Get a list of 1st degree connections.
For each user in the connection from step 1, get their connections.
Aggregate these connections and perform typeahead by prefix.

To reduce the network round trip between the first and successive calls, we can leverage Lua for edge computing.

Acknowledgement

It takes a team to move mountains! I would like to take the opportunity to give a shoutout to all the dedicated Nextdoor folks behind this endeavor:

Shivam Bhalla, Stephen Cheng, Yuki Mizuno, Rajesh Balasa, Siva Pandeti, Uzair Khan, Sharvil Parekh, Hung Dao, Josh Sibelman, Bojan Babic, Jane Wang, Sudhanshu Siddh, Omer Palaz, Kristy Duong, Tristan Eastburn, Paul Meng, Cory Dolphin, Andrew Munn, Tim Wong, Chintan Shah, Rahul Sureka, Madeline Neveaux, Murali Krishna Hosabettu Kamalesha, Glen Tona, and Avinash Chukka.

And by the way, we are hiring!

Typeahead Search at Nextdoor was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Modernizing Nextdoor Search Stack — Part 2

Bojan Babic — Fri, 20 May 2022 20:36:28 GMT

Modernizing Nextdoor Search Stack — Part 2

In our last blog post of the Modernizing Nextdoor Search Stack series, we explained the Query Understanding and the ML models that power our Query Understanding Engine. We also covered the nuances of the Search at Nextdoor and what it takes to understand the customer intent. This time, we will be focusing on the retrieval of the search results and ranking.

Retrieval

Retrieval of information can take many forms. Users can express their information needs in the form of a text query — by typing into a search bar, by selecting a query from autocomplete, or in some cases a query may not even be explicit. Retrieval can involve ranking existing pieces of content, such as documents or short-text answers, or composing new responses incorporating retrieved information. At Nextdoor, we work on the information retrieval given the features that we capture or infer from the Query Understanding stage.

The Query Understanding stage provides us with rich data about the customer intent and context. Query Understanding metadata consists of raw information we get from the user such as device, location, time of the day, day of the week, query itself, expanded queries, embedded version of the query, intent for the query, predicted vertical, and predicted topic of the query, to name a few.

We combine all of this information in the form of a query that we use for the recall. Our underlying retrieval engine is Elasticsearch. Considering the scale of Nextdoor and the amount of data that we produce each day, we need to ensure that the system that we build meets latency requirements. For that purpose, data for our verticals is split into multiple indices and redistributed across Elasticsearch clusters. High-level overview of our search indices can be viewed through the following diagram.

Nextdoor Search Recall

All of our reads and writes to Elasticsearch are run through the homegrown Nextdoor Elasticsearch Service (NESS). Our engine uses rich metadata from Query Understanding mentioned above, and for each vertical that we support we will construct comprehensive Elasticsearch queries that will be responsible for the retrieval. Respective queries are fanned out to the respective clusters to get candidates that we will use for ranking. In order to scale operations of the Elasticsearch cluster, at this phase, we operate only with respective entity ids. In other words, retrieval will return only entity ids.

Ranking

Once we get the results from the retrieval engine, ranking of the candidates is suboptimal. Hence, we introduce a ranking stage in order to optimize search results based on the document features and customer preferences.

When engineers introduce ML ranking stages in existing systems, complexity of the system increases. Newly introduced stages require that the fetching of the features and inferences is extremely performant. The ranking features for the entities need to be at our disposal with minimal impact on performance of the search. For this, we leverage the Real-Time version of our Feature Store. Feature retrieval happens at the stage that happens before we run inference. We call a Feature Store to get respective raw, statistical features, or embeddings representations of the user, hoods, documents and user document interactions.

The Nextdoor platform was built more than 10 years ago. Since then, it has collected tons of data, which means the search team has a lot of data to train our model on. When it comes to ranking, our team AB tested ML ranking by leveraging various features and ranking algos.

To order the search results, we must find the best order for the list of documents for a given user, otherwise known as the ranking problem. The ranking problem is defined as a derivation of ordering over a list of examples that maximizes the utility of the entire list.

For this purpose, we used Learning To Rank (LTR). LTR attempts to learn a scoring function that maps example feature vectors to real-valued scores from labeled data, documents D = {d1, d2, .., dn} and respective input features Xi where i is between 1 and n.

Given the list, the aim of the LTR is to find an optimal scoring function F such that loss over the objective function is minimal. Features of the documents are both traditional and deep.

To find the right discrimination function between positive and negative samples we use a surrogate problem, which is how to fit the hyperplane between positive and negative training samples. However, loss function and real-world evaluation metrics are not easily comparable. Bridging the gap between evaluation metrics and loss functions has been studied actively in the past. LambdaRank, or its tree-based variant LambdaMART, has been one of the most effective algorithms to incorporate ranking metrics in the learning procedure.

The basic idea is to dynamically adjust the loss during the training based on ranking metrics.

Image from appliedmachinelearning.blog

At Nextdoor, we leverage LGBMRanker implementation from LightGBM. During the offline and online evaluation, this model showcased as a great candidate given the tradeoff between accuracy and latencies.

Final thoughts

Working at search is very exciting, there are many challenges. From scaling the retrieval infrastructure and building ML model inferences, to the ever-changing nature of search at Nextdoor. During this work we managed to bring an amazing group of people together to build a better search experience for our neighbors, with the overall goal to build a kinder world.

At the same time, there are many interesting challenges ahead of us with regards to scaling our infrastructure, understanding our customers better, introducing new stages of the search ranking, like multi-objective optimization and reinforcement-learning, just to name a few.

If you find any of these topics interesting and you would like to join our team, please visit our careers page to find our open roles.

Modernizing Nextdoor Search Stack — Part 2 was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using predictive technology to foster constructive conversations

Cathleen Li — Tue, 03 May 2022 16:57:56 GMT

Nextdoor’s purpose is to cultivate a kinder world where everyone has a neighborhood they can rely on. We want to give neighbors ways to connect and be kind to each other, online and in real life. One of the biggest levers we have for cultivating more neighborly interactions is by building strategic nudges throughout the product to encourage kinder conversations.

Today, we use a number of mechanisms to encourage kindness on the platform, including pop-up reminders that slow neighbors down before responding negatively. Over the past few years, we’ve used machine learning models to identify uncivil and contentious content.

Nextdoor’s definition of harmful and hurtful content is anything containing uncivil, fraudulent, unsafe, or unwelcoming, including personal attacks, misinformation and discrimination. In partnership with key experts and academics, we identified various moments that add friction on the platform and implemented those findings with our machine learning technology. Our goal: to encourage neighbors to conduct more mindful conversations. What if we can be proactive and intervene before the conversations spark more abusive responses? Oftentimes unkind comments beget more unkind comments. 90% of abusive comments appear in a thread with another abusive comment, and 50% of abusive comments appear in a thread with 13+ other abusive comments.* By preventing some of these comments before they happen, we can avoid the resulting negative feedback loops.

Nextdoor’s thread model was built to identify potentially contentious conversations, and where intervention might prevent abusive content. The Kindness Reminder, introduced in 2019, and Anti-Racism Notification, launched in 2021, automatically detect offensive or racist language in written comments and encourage the author to edit before it is published. The new Constructive Conversations Reminder uses predictive technology to anticipate when a comment thread may become heated before a neighbor contributes. Below we will share details on how we build the model powering the intervention tools.

What is a thread? Nextdoor conversation threads occur inside a post. Once a neighbor creates a post, other neighbors can comment on the post or respond to each other’s comments.

Neighbors see the comments ordered sequentially and can reply to existing comments. Often, tagging was also used to clarify when the comment is replying to a previous comment further back in the conversation.

The multiple dimensions of this conversation created some complexity around how we define each comment’s parent node and traverse along the parent nodes to recreate the conversation thread. Based on our analysis, we found that the most predictive representation considers all the relationships: mentions or tags, reply to comment hierarchy, and sequential order.

Data

Labels

How do we identify when a conversation is becoming contentious? On Nextdoor, neighbors can report content, and volunteer community moderators help to review and remove content based on our Community Guidelines. For somes cases such as misinformation and discrimination, these reports are sent directly to our trained Neighborhood Operation Staff to review. For the purpose of this model, we decided to use reporting rather than removal as a signal because we wanted a tool to detect early signals of conversations going off the tracks. Regardless of the moderator’s decision to keep or remove a comment, we can assume if the conversation triggers a report it will warrant intervention to prevent potentially contentious responses from being created in that thread. Therefore, for each thread, we chose the creation of a subsequent reported comment as a positive label.

Sampling

Our training data was sampled across multiple months. Less than 1% of comments get reported, and the reported comments tend to cluster together, so this an imbalanced dataset.* To improve the model, we needed to oversample the positive labels, but doing so in a way that results in a representative distribution of comments from different threads. Comments, especially reported comments, tend to cluster around larger, trending threads, and if we sampled randomly we knew we may not get data representative of the many different types of conversations on Nextdoor. We took two steps in the sampling to create a balanced representation of comments from different threads with positive/negative labels. First we sampled by post, and then within the comment thread for each post, we sampled both negative and positive labels.

Model architecture

Features

The key feature in our model is the comment text. There are other features such as the number of reports on a neighbor’s previous content, and comment creation velocity that adds signal, but we found that most of the AUC gains can be made by picking the appropriate thread structure and generating text embeddings from that structure. In future iterations of the model, we aim to make the other features available to the model.

Embedding selection

The embedding, which creates a vector representation of the texts, is an important component of the features. We considered using two different technologies for the embeddings, Fasttext and Bert, comparing their pros and cons listed below:

One of the advantages of Bert is that we can leverage pretrained multilingual aligned embeddings that allows a task model trained on U.S. data alone to perform in other languages and countries. Although Bert model has higher latency than Fasttext, ultimately, we decided it was worth the trade-off because the new features and applications we’ll be running off the model can run asynchronously.

Below describes the architecture for the system. We’re able to tolerate the higher latency at inference time by pre-generating and storing the embedding features, caching scores to be consumed later and delaying downstream tasks dependent on the embeddings.

The classifier itself is a simple dense neural layer built on top of the concatenated embeddings:

We built the embedding features using the sBert API, and experimented with various fine-tuning approaches that might improve the performance. We will discuss the exploration of multilingual embeddings in future blog posts.

Model performance

This model was tasked with predicting whether a future comment on a thread will be abusive. This is a difficult task without any features provided on the target comment. Despite the challenges of this task, the model had a relatively high AUC over 0.83, and was able to achieve double digit precision and recall at certain thresholds.

Below are some examples of comments from threads, and how the model predicted the abusive risk level for these threads:

We can see the model is generally able to identify as comments become more contentious. There is an overall limit on the model’s precision rate, due to low incidence of reporting and challenge of predicting an unknown future comment. Not all abusive content gets reported, as the reporting is primarily driven by the norms of the community, and neighbor awareness of the reporting feature. As a result, we may see a higher false positive rate (unreported comments that model label as highly contentious). A human review of a random sample of this group suggests that these false positive comments are often similar in contentious levels as the true positives.

Internationalization

Once we were able to validate the model in the U.S., the next step was internationalization. The Bert model we selected includes multilingual-aligned embeddings, which was built using the teacher-student model. This model aligns embeddings with similar meaning in different languages to the same vector space. For example “Hello neighbors!” and “Hola vecinos!” would map to similar dimensions.

We found through both offline validation and online A/B testing that even in the U.S., tasks trained on multilingual-aligned embeddings can perform just as well as those trained on English-only embeddings.

We also tested the U.S. model on data from other languages and countries because as Nextdoor continues to expand to other countries and languages, we want our models to be available in each market. We evaluated the U.S.-trained model on multiple countries in Europe where we have a relatively higher adoption rate.

In all countries, with the exception of the Netherlands, we found that the AUC was quite close to U.S. levels. Even in the Netherlands, the performance provided enough signal to test on our products. We did notice a slightly lower precision and recall rate overall in Europe, which could be due to lower reporting rates as compared to the U.S.

Below are some examples of the texts from international threads that were flagged as potentially turning contentious (modified to protect privacy). As you can see, despite cultural and language differences, for the most part, the model was still able to pick up on contentious conversations.

“I don’t share your faith in Brexit …”
“Clearly ignorant of science and facts and harming their own business.”
“Your own comment contradicts itself…”
“La vérité fâche…” (the truth hurts)
“je draait om mijn vraag heen” (you’re avoiding my question)
“io ti ho scritto esattamente ció che avevo scritto in quei commenti che dici che ho cancellato…” (I wrote you exactly what I wrote in those comments that you say I delete)

Ultimately, the performance won’t match the same level as models trained directly on international data, but the current performance is sufficient signal for some of our moderation intervention tools. One caveat for this analysis is that we only evaluated on western countries where we had adequate data. Therefore, it is unclear whether or not these results will translate for non-western cultures and languages.

The ability to predict abusive threads in other languages means that even in countries with sparse data for training, we can transfer what we’ve learned from U.S. data for intervention signals. This will allow us to expand our moderation tools across the globe.

Impact

We found that this signal, when accurately predicted, can be be leveraged by a variety of different product tools to decrease uncivil content:

Comment notification suppression: If a conversation is going awry, suppress notification on the triggering comment
Constructive Conversations Reminder: Prompt neighbors to take an empathetic stance when they are about to comment in a contentious thread

3. Prompt author to close discussion: Remind the post authors they have the ability to close discussion

Once we deployed the model, we were able to start testing some of the intervention methods mentioned above. So far, our results have demonstrated that the model can perform quite well. Comment notification suppression has been rolled out, and Constructive Conversations Reminder has begun rolling out to neighbors in the U.S.

We hope to continue to leverage these findings to expand our toolbox and Nextdoor is a kind and welcoming platform for all neighbors. This work wouldn’t have been possible without the help of Sugin Lou, Karthik Jayasurya, the CoreML team, and the Moderation Team engineers who built the products that the model powers. We continue to partner with leading academics and experts in the fields of social psychology, equality, and civic engagement on our Neighborhood Vitality Advisory Board. Learn more about Nextdoor’s product and policy initiatives to foster a holistically inclusive platform on the Nextdoor Blog.

If you are passionate about solving problems that empower local communities and encourage civic engagement, please check our careers page and come join us!

*Source: Nextdoor internal data, based on Q3 2021 data primarily based in US

Using predictive technology to foster constructive conversations was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Launch Control at Nextdoor

Luiz Scheidegger — Tue, 15 Mar 2022 18:02:30 GMT

How engineers configure and deploy A/B tests and feature flags

In this article, we share our experience building Launch Control, Nextdoor’s combined feature flagging and experiment configuration tool. One of Nextdoor’s core values is “Experiment and Learn Quickly”, and one of our engineering principles is “Move Fast — Build Iteratively”. We believe fast iteration on our products and features is a great way to bring better value to our neighbors around the world. Teams at Nextdoor routinely use data from countless experiments to inform product improvements. Moreover, in an environment where it’s impractical to ship native mobile apps more than about once per week, we also make frequent use of feature flags as a way to safely and gradually release new products to our neighbors. Both of these needs — experimentation and feature flagging — require robust internal tools and strong developer education to be used at scale.

One of the most unique things about Launch Control is how it was built as a strong and ongoing cross-functional collaboration between engineers from all different teams. Its creation came about when a backend product engineer identified opportunities to make the legacy AB and Feature Config tools better. Although they didn’t officially work on internal tools, we strongly believe in ownership and empowerment at Nextdoor, so we ensured they had the space and support to quickly iterate on a prototype. Once this prototype had enough features to get adoption, we made more and more room for that engineer to contribute, and made sure to recognize their impact to Nextdoor engineering.

After this, Launch Control grew into a shared labor of love across Nextdoor — it has key contributions from many of our best engineers across different teams and stacks, from its core backend components, through a delightful React-powered user interface, all the way to its APIs that integrate experiments and feature flags into our Android and iOS mobile apps. A recurring ritual in Launch Control development is to empower and support engineers who identify ongoing feature improvements to make those improvements themselves. This generates a strong sense of camaraderie across teams and helps spread technical knowledge of how the Launch Control stack works.

Evolving two internal tools into one

Before Launch Control, we used two separate tools for experimentation and feature flagging, creatively named Feature Config and AB. These tools suffered from a number of technical limitations. For example, Feature Config allowed engineers to use a rich set of user features, such as their Nextdoor neighborhood, country, or app version, but only produced a binary true/false decision, making it impractical for experiments which need multiple treatment groups. In contrast, the AB tool could output custom treatment groups, but its targeting capabilities were limited to basic percentage-based rollouts.

Additionally, both tools had sparse and uninviting user interfaces, making it difficult for non-technical users to read, or contribute to, experiments and flags. Launch Control was designed to supersede these tools, with a friendly user interface, and a rich set of targeting capabilities, as well as support for arbitrary treatment groups.

Launch Control: easy to understand user targeting

At its core, each individual Launch Control experiment represents a function whose inputs are a set of parameters about a Nextdoor neighbor, such as their id, city, mobile platform, etc., and whose output is a specific treatment group:

Each Launch Control experiment encodes a mapping between parameters about a Nextdoor neighbor and a specific treatment group for that experiment.

When deciding how to encode these functions, we tried to strike a balance between simplicity and expressive power. Our legacy Feature Config tool, for example, allows arbitrarily nested boolean and/or clauses which check for things such as country allow-lists and percentage rollouts. Unfortunately, that tool made it difficult to express complex relationships in a readable way, leading us to occasionally mistarget rollouts, and discouraging non-technical employees from participating in experimentation.

For Launch Control, we found that a non-nesting linear sequence of Condition Blocks, each of which is a combination of targeting features, is a great balance between readability and targeting power. When evaluating a particular Launch Control, the algorithm iterates over an experiment’s Condition Blocks, stopping at the first block which successfully captures a user. That block then assigns a specific treatment group to the user, and evaluation ends. In the case that no Condition Block captures a user, Launch Control automatically returns the reserved “untreated” treatment group for that user, indicating that they are not part of the A/B test, or should not get the feature flag in question.

Condition Blocks are and-combinations of individual targeting features, evaluated from top to bottom. Launch Control assigns a treatment group based on the first Condition Block that “captures” a user.

Condition Blocks: the heart of Launch Control

Launch Control supports many different types of Condition Blocks, which allow for targeting based on a variety of user features:

Individual allow-lists: These blocks allow us to target individual Nextdoor users. This is particularly useful in the early stages of developing a new feature, as we can target internal team members for dogfooding and testing long before the feature is ready for prime-time.
Percentage-based rollouts: Percentage rollouts allow us to encode things like standard A/B tests, where we introduce a new feature to a small subset of our users, comparing those users’ key metrics with a similar-sized control group. In addition, percentage rollouts also give us the ability to gradually release improvements to our users, a few percentage points at a time.
Geo-targeting: Some Condition Blocks allow us to target users based on their neighborhood’s city, state, and country. This allows us to iterate on features we decide to launch at different times for different markets (e.g., if we want to iterate on copy for international markets separately from the US).
Delegation: It’s possible to configure a Condition Block that delegates its decision to another Launch Control experiment. This allows teams to build a hierarchy of flags, providing an easy-to-use mechanism for feature switches that are progressively broader.

There is some interesting nuance in how a stochastic, percentage-based Condition Block needs to work. For most features, we expect the same user to always be either in, or out, of an experiment. It would be unreasonable to make a random selection every time a user is evaluated in a condition block, as users would be surprised when they occasionally find themselves jumping between having and not having a new product experience! Launch Control ensures this random-but-deterministic behavior by using a hash of users’ ids to resolve membership in probabilistic blocks. Briefly, that computation looks like this:

hashed_value = int(sha1(user_id).hexdigest())
hashed_probability = (hashed_value % 10000) / 10000
return hashed_probability < rollout_percentage

The method above ensures users always get consistent treatment groups from each Condition Block, without the need to store an explicit mapping between users and treatment groups on a database table or in memory at any time. For completeness, Launch Control does also offer a “Dice” block, which is a truly random determination, although its use cases are relatively rare.

However, even with a hash-based approach, there is still an additional trap we need to avoid: although the same user should get the same treatment group for a particular Condition Block, we expect users to get different treatment groups for different Condition Blocks. At any time, there may be dozens of different features, all configured as, e.g., 50/50 splits. If we rely on user ids alone for hashing, we would run into users that end up either in all features, or no features at all! Launch Control avoids this by concatenating a per-Condition Block salt to user ids before hashing them. This guarantees each user always gets the same result from a particular Condition Block, while getting potentially different results from different Condition Blocks. When we incorporate a salt into the evaluation, the previous algorithm looks like this:

salted_string = f'{condition_block_salt}{user_id}'
hashed_value = int(sha1(salted_string).hexdigest())
hashed_probability = (hashed_value % 10000) / 10000
return hashed_probability < rollout_percentage

The targeting engine described above is only one component of Launch Control. We also provide performant, easy-to-use APIs for engineers to query experiments across our tech. stack, in our application backend as well as our React frontend and native Android and iOS clients.

Backend vs. Frontend Launch Control APIs

Nextdoor engineers query Launch Control experiments in many places in our codebase: in our backend as we serve web requests to our clients, as well as in our desktop and mobile-web front ends, and finally in our native Android and iOS applications. In all cases, we provide APIs with two specific goals in mind: ease of use and fast, reliable performance. It is imperative that evaluating an experiment take no more than a few hundred microseconds, as complex user features like our main feed requires dozens of individual flags and experiments to render.

Launch Control experiments are stored internally as Nextdoor Sitevars. Because of this, our backend containers automatically have access to cached, in-host payloads for all relevant experiment definitions. This makes it relatively easy to have the backend Launch Control APIs be performant. Evaluation typically involves a small amount of CPU operations, and thanks to the Sitevars cache, requires no RPCs or network calls. Even despite this, however, Launch Control also maintains a per-web request cache of evaluated experiments, ensuring that each experiment is evaluated at most once, per user and per request.

On the other hand, the frontend and mobile APIs present more of a challenge. Since this code runs far from our servers, it’s impractical to cache and constantly synchronize all experiment definitions. It would also be prohibitively expensive to make each individual experiment API call involve a network operation. Because of this, we provide two flavors of Launch Control APIs on our clients.

The most commonly used flavor is a local API, which relies on the client having a known, pre-fetched list of experiments available in memory at all times. Clients run a single network request which fetches all experiment evaluations in this list at key application lifecycle moments (such as user login, session refresh, etc.), and those results are then available for code to query against in a synchronous fashion. We also provide an asynchronous API, which does allow engineers to make individual network requests for each experiment. This API only exposes non-blocking components, however (such as Observables on Android and Promises on web), to make its asynchronous nature clear and self-enforcing. While the local API is used for the most common experiments users are exposed to, the asynchronous flavor helps prevent rarer use experiments from excessively bloating the prefetch list. Finally, while it is out of scope for this article, Launch Control also provides specific APIs for engineers to run experiments in special cases, such as login and sign up screens, where we do not yet have a particular user available.

Other Features

The most important components of Launch Control are its targeting algorithms and query APIs, as described above. However, it also has a number of convenience and ease-of-use features, designed to make experimentation and feature flagging at Nextdoor a delightful experience. We strongly believe that engineers deserve tools that are as good as the products we ship to our neighbors around the world. Some of the other important Launch Control features include:

Contextual editing UI: The UI for each Condition Block type is unique and has deep knowledge of that block’s context. For example, allow-lists for individual users expose a generic search and typeahead. This allows us to add users by their name, email, or other traits, eliminating the need for people to memorize ids, and empowering team XFN partners such as Product Managers and Designers to add themselves to experiments directly.
Versioning and fast reverts: Launch Control experiments are stored internally as Nextdoor Sitevars, which have built-in support for versioning. Whenever a Launch Control is updated, we store a new version of it, with useful metadata such as edit time and author. This allows us to easily navigate through the full history of an experiment, eliminating the need to maintain redundant documentation on when a particular rollout changed. In addition, storing every version of an experiment also allows us to quickly revert features to known-good states in case anything goes unexpectedly sideways with a release.
Built-in observability and update subscriptions: Every Launch Control automatically publishes useful real time statistics whenever it’s evaluated, for all users. This allows teams to verify, in real time, that their experiments are going out to the expected number of users, and in the right proportions. In addition, employees can also subscribe to Launch Control experiments, so they get automatically notified when an experiment is updated.

Conclusion

Launch Control experiments drive many of our recent Nextdoor product improvements, such as changes to Notifications, Search, and our Business Experience. By striking a balance between expressive power, simplicity, and usability, we were able to collaboratively build a tool that experienced widespread internal adoption.

Have you worked on, or used, feature targeting and A/B testing tools in your career? Let us know in the comments below what you’ve learned works and what doesn’t, and if you’re excited about collaborating with other great engineers on impactful work like this, check out our careers page! We have open opportunities across different teams and functions.

Launch Control at Nextdoor was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learn DevOps by Doing

Slava Markeyev — Wed, 23 Feb 2022 17:44:17 GMT

A tricky problem and what you can learn from it

Nextdoor’s cloud engineering team moves fast and sometimes it means solving or working around quirky and, often undocumented, problems in vendor services and open source tools. This blog post walks through one of those problems we recently encountered in our infrastructure. While the problem is rather niche, I thought this exercise might be interesting for readers who are less familiar with the problem space, those who might be new to the field of DevOps and are looking to chip away at the question of “what do DevOps engineers do?” In an effort to answer part of this question, I wanted to share a problem we recently came across with validating users in our service mesh. For new readers, the setup to the problem is fairly technical but the crux of it and the solution is actually straight-forward. I hope this account won’t scare you but rather make you excited about diving into problems by keeping the following in mind:

There’s nothing that you can not learn that can not be learned. There is nothing you can not do that can not be done.

Setup and Overall Goal

My co-worker came to me with an interesting problem. He set up an AWS Application Load Balancer (ALB) and wanted to use the user authentication feature. This works by the load balancer sending the user through an authentication flow with an identity provider, like Okta or OneLogin, and retrieving information about that user from the identity provider after they have logged in. The information looks something like this:

{
  "sub": "1234567890",
  "name": "Slava",
  "email": "slava@example.com",
  ...
}

The load balancer then creates a cryptographic signature of this information along with information on which encryption key and hashing algorithm were used during the signing process. This is then stored in the x-amzn-oidc-data HTTP request header which gets forwarded along with the original HTTP request from the user to the web application.

Stepping back for a moment, this authentication flow typically happens in web applications. However, by setting up our infrastructure such that this happens earlier on in the request flow we can standardize the authentication processes and simplify lives of application developers. Anecdotally, this sort of standardization to benefit all developers is a major cornerstone of successful DevOps teams.

The Environment

As I mentioned, the x-amzn-oidc-data token is typically verified by the downstream application itself but there are reasons to additionally perform the verification earlier on such as layered security, auditing, logging, and routing.

Basic network topology

In our environment, after the request gets processed by the load balancer it enters our Kubernetes service mesh before being routed to its final destination, a web application. A service mesh is an intelligent part of the network infrastructure which dynamically routes requests based on a number of factors to applications. There are a handful of different mesh options available for Kubernetes; our team chose to deploy Istio which is built upon Envoy. Envoy is a powerful proxying tool initially developed at Lyft and subsequently adopted by the Cloud Native Computing Foundation. While at its core Envoy is written in C++, it also allows developers to write scripts in Lua to process HTTP requests.

The Data

The information in the x-amzn-oidc-data header is written in the JSON Web Token (JWT) format. The JWT format simply defines how to encode the information about the signer, the user, and the signature itself. The pseudo code looks like:

header = {
  "alg": "ES256",
  "typ": "JWT",
  "kid": "1234",
  "signer": "arn:aws:elasticloadbalancing:us-west-2:1234567890:loadbalancer/app/foobar"
}

claims = {
  "sub": "1234567890",
  "name": "Slava",
  "email": "slava@example.com"
}

# 1. Encoder claims and header
header_enc = base64.encode(header)
claims_enc = base64.encode(claims)

# 2. Concatenate them together with a period
payload = header_enc + '.' + claims_enc

# 3. Sign and encode signature
sig = crypto.sign(payload, private_key, method="ES256")
sig_enc = base64.url_encode(sig)

# 4. Concatenate signature with the data to form the JWT
jwt = payload + '.' + sig_enc

In order to verify the JWT the application receiving the request needs to go off and lookup the public part of the signing key by its kid (key id) from an endpoint which stores the signing key. Typically, the endpoint to retrieve the key follows the JSON Web Key Set (JWKs) api specification but AWS chose a different approach for making the signing keys available.

While Envoy already has support for verifying JWT tokens, it only supports the JWKs standard endpoint for retrieving public keys and unfortunately AWS’s endpoint is non-conformant. Luckily with Envoy’s extensibility, we can implement the key fetching and verification ourselves in Lua.

A positive attitude

By the time I got involved my co-worker had already come up with some Lua code to extract the claim headers, fetch the signing key from an AWS endpoint, and call a signature verification method to verify the claims were signed by the ALB. The only problem was that signature verification kept on failing in a non-obvious manner. Up to this point, my knowledge of JWTs was limited but I had a can-do attitude.

Solving the signature verification problem proved to be quite challenging. Fairly early on my co-workers identified that the Lua language might not be sufficient enough to solve the problem. This led me down a rabbit hole of writing a signature verification extension for Envoy in TinyGo and subsequently in Typescript after my initial attempt failed. Unfortunately both attempts were thwarted due to subtle limitations in the languages and runtime. However, there was another way and this post will go into the details.

One might be misled by the brevity of what I’m about to discuss but getting to this point entailed a lot of frustration and dead ends. While grit is the most important aspect of an engineer’s toolkit so is knowing when to quit. Quite honestly, I was about to give up on this problem and spend my time on a more fruitful endeavor but I caught a lucky break by revisiting a prior step with some newfound knowledge from along the way.

Where I started

Let’s rewind all the way to the beginning. I initially began this journey by reading the ALB documentation to see if there was something simple we had missed. While there was nothing obvious, the docs did contain a snippet of Python for performing the verification. This was not super interesting in and of itself because it simply called Python’s JWT library to do the signature verification but it was a starting point since the example did work.

The example looked like:

import jwt, requests, base64, json

# Step 1: Get the key id from JWT headers (the kid field)
encoded_jwt = headers.dict['x-amzn-oidc-data']
jwt_headers = encoded_jwt.split('.')[0]
decoded_jwt_headers = base64.b64decode(jwt_headers)
decoded_jwt_headers = decoded_jwt_headers.decode("utf-8")
decoded_json = json.loads(decoded_jwt_headers)
kid = decoded_json['kid']

# Step 2: Get the public key from regional endpoint
url = 'https://public-keys.auth.elb.' + region + '.amazonaws.com/' + kid
req = requests.get(url)
pub_key = req.text

# Step 3: Get the payloadpayload = jwt.decode(encoded_jwt, pub_key, algorithms=['ES256'])

Stepping through this example with a Python debugger and going down the call tree step by step provided a clue as to what was missing from the validation happening in Envoy/Lua.

The signature

To sign the JWT token the ALB requires three pieces of information 1) the data being signed 2) the private key and 3) a hash function. Per the documentation the ALB uses what is known as ES256 or ECDSA key + SHA256 hash function. The output of this signing function is two 256-bit integers referred to as R and S. The two integers are concatenated together forming 64 bytes which represents the full signature.

Illustrates that R and S are two 256 bit integers

When doing the verification we feed the verification function the data, public key, hash function, and signature (R + S). The verification function will tell us if the data was in fact signed by the public key’s corresponding private key, thus establishing the trust relationship.

I noticed when stepping through with a Python debugger that before the 64-byte signature was fed into the verification function, it was being split back into R and S and the two were converted into integers before being encoded into a DER encoded ASN.1 structure. ASN.1 is simply an encoding format dating back to 1984 which remains popular in cryptographic applications like OpenSSL. The discovery of the splitting, converting to integers, and encoding seemed “interesting”.

Stepping through into jwt/utils.py and its use of encode_dss_signature

Knowing the signature may need to be encoded gave a hint as to why it wasn’t working in our Lua script. A quick read of the Envoy docs and following the source code showed the verification function in Envoy’s Lua was actually backed by a C++ helper function. The helper function calls the EVP_DigestVerify function in OpenSSL which performs the validation. While documentation for this function is rather scant, googling around for usages of it in other projects showed that the signature was being encoded beforehand.

Finding a solution

At this point the obvious problem of why the signature verification was failing was solved. The less obvious and trickier problem was how do we encode the signature before passing it to the verification function. This finally brings us to the crux of this fun little problem.

Comparing outputs

How does one encode ASN.1 in Lua? The obvious answer is you turn to GitHub and find a library someone already wrote, something like this one from NMAP. But what if that doesn’t work? It initially didn’t work because the library was written for a newer version of Lua than what Envoy supported. After some modifications to replace new language function calls with older ones I got the encoder running but it was producing the wrong output. I knew this because I had the output from the encoder running in Python and was able to compare the two.

[slava]$ hexdump lua_output.bin
0000000 30 0c 02 04 71 6d fb 33 02 04 ce 83 66 58
000000e

[slava]$ hexdump python_output.bin
0000000 30 45 02 20 71 6d fb 33 00 91 74 3e a8 92 ce 3c
0000010 e0 7c 30 cf b8 5d 08 62 5d 82 c1 31 a5 95 93 11
0000020 d1 77 bc 81 02 21 00 ce 83 66 58 a5 11 57 23 30
0000030 59 0f 80 3e f3 a9 ae 25 9a d5 ed 20 84 50 57 94
0000040 52 6c 92 6e 2b 99 2c
0000047

At a first glance those two outputs don’t look similar at all but on a closer inspection they do have some similarities which I’ve marked in bold. Somehow only part of R and S are being encoded. A good guess would be that there’s truncation happening here.

As it turned out, unpacking an integer represented by a 256-bit array into an integer primitive in Lua simply wasn’t going to work in a language that primarily dealt with 32-bits. This is the point where I almost gave up because going off and writing an unpacking library that made use of bigint under the hood and somehow tying that into the ASN.1 encoder was more work than I wanted to do at that point.

Noticing a pattern

Then I saw a commonality. What if I could take a shortcut? While I was playing around with the encode_dss_signature function in Python I started to notice a pattern.

# Bit of python that encodes R and S into ASN.1 and then writes the
# signature into a file.

from cryptography.hazmat.primitives.asymmetric.utils import encode_dss_signature

r = 42
s = 63

sig = encode_dss_signature(r, s)

f = open(f'/tmp/{r}_{s}.bin', 'wb')
f.write(sig)
f.close()

Using the hexdump cli tool we can print the encoded binary files out in hex format.

# r = 42  hex value: 2a
# s = 63  hex value: 3f

[slava]$ hexdump 42_63.bin
0000000 30 06 02 01 2a 02 01 3f

# r = 12   hex value: 0c
# s = 127  hex value: 7f

[slava]$ hexdump 12_127.bin
0000000 30 06 02 01 0c 02 01 7f

# r = 4277071598   hex value: feeeeeee
# s = 4294967295   hex value: ffffffff

[slava]$ hexdump bignums.bin
0000000 30 0e 02 05 00 fe ee ee ee 02 05 00 ff ff ff ff

The pattern was that the bytes of R and S were directly sandwiched between some bytes that typically were the same except when R and S grew in bignums.bin. A thought popped into my head, I don’t actually need to convert the two into integers if the ASN.1 encoder is simply turning the integer value back into a series of bytes.

Making sense of the pattern

I found the openssl asn1parse command as part of the openssl cli tool which provided structural information about the encoded signature file.

[slava]$ openssl asn1parse -in 42_63.bin -inform der -i
   0:d=0  hl=2 l=   6 cons: SEQUENCE
   2:d=1  hl=2 l=   1 prim:  INTEGER           :2A
   5:d=1  hl=2 l=   1 prim:  INTEGER           :3F

The key thing to note from the above command output is the file has a structure to it. Specifically there are references to SEQUENCE and INTEGER. This makes sense since we encoded two integers. The sequence must be referring to essentially a list of things of which we have two integers.

I typically look at the official Request For Comments docs detailing this sort of thing but in this case I had some trouble understanding the encoding scheme as detailed in rfc6025. After a bit of Googling I found the following table that helped explain what I was seeing in the hexdump output in relation to the output from asn1parse.

https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/#tag

A SEQUENCE is defined by the hex value of 30 and INTEGER by hex value 02 . Piecing it together we can decipher the hexdump of R = 42 (2a) and S = 63 (3f) .

# r = 42  hex value: 2a
# s = 63  hex value: 3f

[slava]$ hexdump 42_63.bin
30 06 02 01 2a 02 01 3f
^^ ^^
|
|+- Sequence (30) of length 6 (06) bytes

30 06 02 01 2a 02 01 3f
      ^^ ^^ ^^
      |+- Integer (02) of length 1 (01) byte with value 2a

30 06 02 01 2a 02 01 3f
               ^^ ^^ ^^
               |+- Integer (02) of length 1 (01) byte of value 3f

Finding a solution

From here, we can compose an ASN.1 DER signature fairly easily since we have R and S along with their lengths. I’ve included an example solution on GitHub along with a working Envoy configuration to see this in action.

GitHub - stlava/envoy-asn1-example

A look back

This was quite the adventure from where we began with the initial problem statement of wanting to use a load balancer to do user authentication and verifying requests in the service mesh. For me it involved reading mounds of documentation and technical blog posts like this one. As a result, I now know a whole lot more about Lua, ASN.1 and JsonWeb Tokens and I also dove into web assembly with TinyGo and TypeScript. Every day isn’t like this, but this little adventure is representative of the broad domain an engineer may work through in a role on this team.

Conclusion

In closing, I want to touch upon a few parting thoughts for new readers. Operations is an ever evolving field with team titles like SRE, DevOps, DevSecOps, Cloud Engineers, and others people keep inventing. An engineer on one of those teams may be working on completely different tasks than their counterpart at a different company.

A successful DevOps engineer may not be a subject matter expert like their peers on other teams but they nonetheless possess a powerful toolkit. Simply reflecting on this adventure I’d include the following:

The ability to find and synthesize documentation and code in software they are not familiar with.
Being comfortable reading code in languages they do not practice.
The tenacity to walk into a problem space with a positive attitude.

This list is by no means comprehensive and all of these skills are built up over time. However, for new readers just entering the field the most important thing is a willingness and commitment to learn and answer the question of: “How does something work as you peel back the layers?”

Don’t be afraid to ask and seek answers for questions like: “what happens when I type google.com in my address bar and hit enter?” or “how does my computer boot up?”

Note: Some readers may look at this list and say these are traits of any successful engineer and while I agree with you, I would also argue DevOps and DevOps adjacent engineers need to walk into unfamiliar problems in foreign pieces of code day in and day out.

Learn DevOps by Doing was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Engineering Principles (v1) at Nextdoor

Antonio Silveira — Wed, 02 Feb 2022 17:52:55 GMT

At Nextdoor, the written version of these engineering principles is new; however, the principles themselves are not. These principles developed from within Nextdoor’s values, our purpose, how we’ve worked together over these last several years. Our goal in sharing these with the broader engineering community is so we can help learn from one another.

Engineers are constantly faced with weighing the advantages and disadvantages of various approaches and in many ways, it’s the art of making constant tradeoffs. Similar to other engineering organizations, Nextdoor has a set of unwritten rules on how to evaluate tradeoffs and to make strategic decisions. These principles were effectively embedded in the engineering culture and they spread verbally amongst our engineers over time.

As we rapidly scale Nextdoor’s engineering organization, we quickly realized it is important to explicitly write out these principles rather than leaving them unspoken. Moreover, unwritten rules cannot scale for a growing engineering organization distributed across multiple locations. Written rules are also more equitable as everyone has equal access to them. We want all new engineers to learn and apply our engineering principles quickly.

The goal for our new engineering principles is threefold. First, we want to create a shared vocabulary for how we weigh tradeoffs. Second, we wanted these principles to help us make consistent decisions all throughout the organization. Third, we wanted to create a shared understanding of the rules of engagement for how engineers and teams should interact with each other.

These principles are not meant to be dogmatic but rather to bias us towards a particular way of thinking. When engineers weigh the pros and cons of an approach, we think of these as a finger on the scale guiding us in a certain direction. Applying the principles requires exercising good judgment. In some cases, the principles may even be in tension with each other.

This post is very long, and that’s by design. As a company, Nextdoor has a small number of core values. We also have guiding principles that help us make day-to-day decisions. Each principle contains the description and also the rationale behind the principle. The reason for this is that these principles are often nuanced and we don’t want them being mindlessly applied as if they are hard-and-fast rules. It’s important to understand the intention behind the principle to know when and when not to apply it. In some cases, we explicitly call out the “limiting principle” to help us avoid over-indexing on these principles.

And if you are interested in building active local communities and are interested in our engineering principles, please reach out. We have multiple opportunities across all technical areas at Nextdoor. Please check our Careers Page out for a detailed list of all the roles currently open and let’s connect on Linkedin.

Engineering Principles at Nextdoor

Purpose-Driven not Tech Driven
Software Engineering is a Team Sport
Empowering Engineers
Moving Fast
Collective Code Ownership
Optimize for Learning
Craftsmanship

Purpose-Driven not Tech Driven

We’re a purpose-driven company. We never build technology for the sake of technology. We build technology to deliver value to the neighbors and organizations we serve. We build great infrastructure to enable our engineers to move faster. We use data and insights to understand how our changes impact our customers.

⭑ Recognize impact over shipping

In general, we bias towards recognizing sustainable impact over shipping. Impact is defined as delivering value to our neighborhoods and our customers (both internal and external). We strive not to “confuse movement with progress.” We strive to measure our impact with data when possible.

Why is this important? We want to align our engineers’ incentives with the company’s purpose.

Limiting Principle? Not every project we try will succeed. We will likely fail as much as we succeed. We never stigmatize failure, but we do expect that over the long term, the impact of our successes will outweigh the costs of our failures. See Optimize for Learning below. In addition, at the more junior engineering levels, we do recognize and reward execution. Also, not every impactful change can be measured so sometimes the impact will be based on conviction, alignment with our product strategy, company values, and qualitative feedback.

When evaluating impact, we should always think about it as the impact-over-time summation (i.e. the integral under the curve). We don’t want to create a culture that rewards an engineer with achieving a modest short-term improvement at the expense of a large amount of tech debt that dramatically slows down future productivity. We should consider the impact holistically and take into account the short-term improvements with the long-term costs. This is heavily tied into our company value of “Act Like an Owner” as an owner should always be considering value over a long-term time horizon.

⭑ Value over originality

We shouldn’t strive to be original for the sake of being original. Being different is not a goal. Being original is valuable when it’s in service of our purpose. In fact, much of what exists in our Engineering Principles is taken from the learnings of other companies.

Corollary. We don’t build when we can buy something or use open-source that meets our needs and building it doesn’t differentiate Nextdoor or the experience.

Limiting Principle? We shouldn’t blindly copy other technical and product decisions from other companies. We should always evaluate decisions in terms of Nextdoor’s purpose and values. We should understand that a decision made at a certain point of time does not necessarily mean it’s the right decision today.

⭑ Reward the foundational work

We celebrate and reward the foundational work that is critical to delivering value to our neighborhoods and to our customers. This work is equally as important and necessary as feature work. By foundational work, we mean all the work that goes into supporting a fast and polished product. This includes areas that provide leverage such as Infrastructure, Developer Experience, Platform APIs, and Tools. It also includes the work to maintain a clean codebase that we take pride in. It also includes the craftsmanship that goes into building a polished product that delivers delight to our neighborhoods.

Why is this important? Every type of work is important when it’s in service of our purpose. Fixing tech debt, supporting legacy systems, doing large painful refactorings, writing SQL queries to analyze an experiment are examples of this important work that we should all take on. Senior engineers should be models of this behavior. We should never think of a certain type of work as being dirty or unglamorous when it’s in service of our purpose.

⭑ Pragmatism over Dogmatism

As engineers, we are constantly faced with weighing tradeoffs and making recommendations and decisions. We hold opinions loosely in order to make pragmatic decisions based on known requirements. When conditions change, we let go of previously-held opinions. We bias towards making choices based on weighing the costs and benefits.

Limiting Principle? We don’t compromise on our company’s core values and we have created a set of guiding principles in this document to help us guide our behavior and decision-making process. That said, most decisions are rarely all or nothing and we are often tasked with evaluating tradeoffs when our principles or even core values may be in tension with each other.

⭑ Globally Optimize

We strive to make decisions that adhere to “What’s in Nextdoor’s best interest?” We are system-level thinkers and we take into account how our changes impact other teams in engineering, the entire company, our neighborhoods, and our customers.

Why is this important? Conway’s Law states that “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” By keeping this top of mind, we can resist shipping our org chart when it’s not serving our purpose.

Software Engineering is a Team Sport

We build software as teams, not as individuals. It’s essential that we are able to collaborate well with each other in order to reach the best outcomes.

⭑ We treat each other with respect

We are members of a community and it is essential that we treat each other with respect. We’re invested in one another’s success and this means we listen to each other and help each other to succeed. We understand that work is just one part of people’s lives and we seek to understand people in their entirety and withhold judgment.

Why is this important? Each engineer brings their unique superpowers, perspectives, and background to the table. We are only able to do our best work and harness our unique abilities when we treat each other with respect.

⭑ Optimism over Pessimism. Avoid cynicism.

We approach our work and our coworkers with positivity. We like the improv rule of “Yes and” instead of our initial instinct being “No” since starting with “No” tends to stifle creativity. We want to aim to assist the person.

Why is this important? Pessimism and cynicism are contagious and drain energy and creativity. A single pessimistic or cynical person can destroy the productivity of an entire team (study).

Limiting principle? We still must be realistic and provide valid feedback. There’s always a way to provide honest feedback while still remaining positive. We also don’t want to create cases where one engineer is blocking another engineer and the onus is solely on the blocker to unblock the work. It’s a shared responsibility to try to get to “Yes”.

⭑ Clean escalations when there are disagreements

When we disagree, each side should strive to understand the other side. When both sides can articulate the other’s point of view, they are able to distill the disagreement down to its essence. A clean escalation is one where both parties bring the disagreement to a third party in order to help mediate or arbitrate the decision. A dirty escalation is where one party escalates to the third party without the other. There should be a willingness to disagree and commit when a consensus can’t be reached.

Why is this important? Many disagreements are not actually disagreements but rather the two sides talking past each other or failing to understand the other’s point of view. When the parties truly understand each other’s point of view, there often doesn’t need to be an escalation. If an escalation is required, it’s more efficient when the disagreement has been distilled down to its essence.

Empowering Engineers

We hire talented people and empower them to accomplish our purpose. Engineering should be rewarding and building should be a creative process. Easy things should be easy and hard things should be in service of the actual business problems versus fighting with the infrastructure or tools.

⭑ Context over control

We believe in providing engineers and engineering teams with the context they need to make sound decisions as opposed to being order takers. We want engineers to operate with autonomy as part of their cross-functional teams while taking into account the needs of their stakeholders. It’s the obligation of leadership to provide the context such as our strategy, our priorities, and our resources so engineers can make sound decisions.

Why is this important? Engineers are problem-solvers that collaborate with their cross-functional partners to build great products for our customers and great infrastructure for other engineers and stakeholders. We want to avoid a permission-seeking culture where engineers feel they need to be told what to do or seek management approval. An engineer at Nextdoor should not function as an API that is called by their Engineering Manager or Product Manager to mindlessly translate tasks and specs into Pull Requests.

⭑ Infra > Tools > Docs > Process

We prefer solving problems at the infrastructure level to reduce the burden on our engineers and to implement consistency across the company. If we can’t solve it at the infra level, we’d prefer solving it with tooling rather than through documentation or process.

Why is this important? We want our engineers to be able to be as productive as possible. Process can slow engineers down, even when there is the best of intentions. Documentation gets out of date or is undiscoverable. Infra automates the work away and reduces cognitive overhead. Great tooling can often replace manual processes.

⭑ Self-service over file a ticket

We prefer self-service tools over processes that require human intervention.

Why is this important? Self-service can be instantaneous. It avoids blocking an engineer on another person and reduces context switching. We strive to reduce dependencies and processes that require human intervention. An example is the impersonation permission flow which allows an engineer to obtain temporary permission to impersonate a neighbor without requiring human intervention in a safe and auditable manner.

⭑ Transparent Decisions

We are transparent with our decisions, explaining the reasons behind our decisions, and the decision-making process itself. We do our best to document and share our decisions, and make meeting notes visible to the entire company when possible. At Nextdoor we use the SPADE framework to document and broadly share important decisions for product, engineering and business topics.

Moving Fast

We believe that moving fast is essential for us to accomplish our purpose. This is encapsulated in our company core value of “Experiment and learn quickly”. We also recognize that there are tradeoffs between moving fast now versus moving fast over the long term.

⭑ Under-engineering is preferable to over-engineering

We prefer to validate our product and engineering decisions quickly and add things later that we missed. Therefore, we want to architect our systems taking into account known requirements and build what is truly needed.

Why is this important?

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system. — Gall’s Law

Limiting principle? We should be careful to not box ourselves into a corner with our early decisions but also not to waste resources over-engineering something where we don’t understand the long term requirements.

⭑ Working code over comprehensive documentation

We prefer to bias towards building and prototyping over writing long architectural docs and blocking on architectural approvals. We agree with the Agile Manifesto.

Why is this important? We want to be as agile as possible and learn through experimentation and iteration. A prototype, like a picture, says a thousand words.

Limiting principle? Large changes that impact many engineers often require more up-front design and stakeholder feedback. See Farm for Dissent. We should write overview docs and how-to-docs for infrastructure that is used by many engineers. We are not against documentation. This is a “bias towards” not an “all or nothing” approach.

⭑ Build iteratively

We prefer to build iteratively such that our changes deliver value incrementally.

Why is this important? We want to avoid multi-quarter projects where no value is delivered until the very end of the project. Industry experience shows that these projects tend to be high risk and often fail. Almost all projects can be broken up into milestones that deliver value iteratively along the way and help validate the direction, build momentum for the team, and avoid trying to hit a moving target where the requirements change over time.

Limiting principle? Larger, strategic projects may take longer to deliver value. To ensure that we are on the right track and making continuous progress, these projects should have milestones that incrementally validate themselves if the value can’t be delivered more iteratively.

⭑ Infrastructure Should Be Easy/Secure/Fast By Default

Our infrastructure should enable engineers to fall into the “pit of success”. We pick defaults that should lead to the right outcomes. We strive to eliminate foot guns. We strive to minimize writing the same boilerplate code over and over. We optimize the infrastructure to make the lives of the customers better as this is the direction of the leverage. The scope and responsibility of the infrastructure maintainers expand to accomplish this goal. Building on top of our infrastructure should feel like playing with legos. You can’t hurt yourself playing with legos, but you can build something great.

Why is this important? Our engineers are more productive when the infrastructure allows them to focus on the product value rather than solving recurring problems that can be solved once for everybody.

⭑ Minimize repositories and services

We strive to minimize the number of repositories and services that need to be touched to build an end-to-end product change. We explicitly want to identify having micro-services that provide middle-tier functionality as being an anti-pattern.

Why is this important? We measure our velocity by how quickly we can deliver end-to-end improvements to our customers. Even though having more services can make it faster to merge a single PR, it slows down product development when an end-to-end feature requires merging into multiple repositories, sequencing the deployments, and maintaining protocol compatibility between services. In addition, it’s impossible to factor product functionality correctly from the beginning as products evolve and the requirements change. Refactoring code within a single service is much easier than refactoring the services themselves.

Limiting principle? Good candidates for services are ones where the interface is extremely stable and product changes rarely need to be coordinated across protocol boundaries (e.g. DynamoDB, Redis, Postgres). Separate vertical products that have very little coupling to other products may also be good candidates (e.g. Ads Platform).

⭑ Two-way doors are better than one-way doors

We prefer decisions that are easy to undo. We optimize for making cheap mistakes, since making mistakes is unavoidable. When considering a one-way door, we should heavily lean on Farm for dissent.

Why is this important? Making mistakes that are easy to undo means we can move faster and learn quickly. One-way doors require a lot of time and energy since the stakes are much higher.

⭑ Testability and Testing

Our code should be easy to test and well-tested. When building libraries and infrastructure, it’s not enough to write unit tests for the library. The library should be written such that the code that uses it is also easy to test. A library that makes it hard to test the code that uses it is not a very good library.

Why is this important? Unit, integration, and automation tests help us move faster. While it might slow down the initial development to write the tests, in the long run the tests allow us to move faster in the future and avoid costly regressions.

Limiting principle? We are not a test-driven development shop that requires 100% code coverage. Writing and maintaining tests have a cost and a benefit. We should make writing tests as easy as possible. We should have a bias towards writing tests, but we should be pragmatic and weigh tradeoffs.

⭑ Observability

We should strive to make our services and applications have a high degree of observability. Analytics, logging, error handling and metrics are a critical part of everything we build.

Why is this important? Nextdoor is fundamentally a complicated system. It’s made especially complicated because it’s an ecosystem that has network effects where users affect each other’s behavior. It’s critical that we write our products and services to be as observable as possible so that we know they’re working as expected and so we can quickly perform root cause analysis when something goes wrong. A small amount of work to build observability upfront generally saves countless hours later.

Collective Code Ownership

Collective code ownership stems from our belief that Software Engineering is a Team Sport and supports our desire to Empower Engineers and to Move Fast. Engineers need to be able to easily read, understand, and modify code written by other engineers.

⭑ Stewardship over strict code ownership

Our intention is similar to what Martin Fowler describes as weak code ownership. Engineers should not act defensively when another engineer or team desires to make a change to the code they generally maintain. Instead, they should act as stewards of the code advising other teams on how best to work together to facilitate a change that furthers our purpose. This should be in the context of the ProdDev ownership process where the team owning the area of the product affected should be the approver of the change.

Why is this important? We don’t want our engineers acting territorially. This slows down innovation and makes cross-cutting changes more difficult. We generally want to rely on engineers to exercise good judgment when seeking code reviews rather than requiring strict approvals from code owners. This is enabled by having a safe infrastructure.

Limiting Principle? We expect engineers to exercise good judgment and seek feedback and code reviews from the persons or teams that are best able to review the change. We expect engineers to follow the ProdDev ownership process.

⭑ Minimize languages and frameworks

We strive to minimize the number of languages and frameworks used at the company.

Why is this important? We want engineers to be able to contribute across the product and at different levels of the stack. This enables engineers to work in more areas of the product, reduces dependencies between teams, and increases agility of the organization. This also reduces silos and allows us to work as one engineering team.

Limiting Principle? Often there are good reasons to introduce a new language or framework. This must be balanced against the long term maintenance cost and complexity introduced to the engineering organization. It should be done with strong intention rather than convenience. Valid reasoning: “We should write the Ad code in Java because Java is the standard language across the broader advertising ecosystem and all the libraries assume Java.” Less valid: “There’s a cool new library in Rust. Let’s write this service in Rust.”

⭑ Opinionated conventions

We should have conventions that are opinionated and consistent across all of engineering. Our aspiration is that code written by multiple engineers should be indistinguishable in style, patterns, and structure.

Why is this important? We want engineers to easily move between areas of the code base in order to build end-to-end functionality for our customers. Code is easier to read and modify when we have consistent conventions. We can build better infrastructure and tooling around known conventions and best practices.

Optimize for Learning

We foster a culture of learning that optimizes for exploration, openness, and creativity. We hire employees who are united by our shared curiosity and desire to apply our learnings towards accomplishing our purpose.

⭑ Give and receive feedback with good intent

We should give and receive feedback with the assumption of good intent. When giving feedback, we should strive to make it constructive, timely, and actionable. When receiving feedback, we should assume the giver is trying to help us improve. We should always acknowledge and be thankful for the feedback. That does not mean the feedback must be accepted. It’s also okay to thoughtfully consider and reject feedback. When a person acts on our feedback, we should close the loop by acknowledging the improvement.

Why is this important? We’re a growth-oriented culture and we learn and improve through feedback. We want engineers to feel comfortable providing feedback because withholding it denies the other person information they may need to grow and results in worse outcomes for the company.

Limiting principle? Feedback is never an excuse to create a culture of “brilliant jerks”. Feedback should never be weaponized. Feedback should only be given when the intent is to help the person improve and the feedback is actionable.

⭑ Farm for dissent

Engineers should socialize their ideas to seek out diverse perspectives from both inside and outside engineering. The larger the risk and cost of a mistake, the more important it is to seek out feedback. Engineers should exercise good judgment on who are the best people to provide feedback based on the particular domain of the change. Often the best people to consult are in other functions such as Data Science, Marketing, Product, Design, ProdOps, Legal,, etc.,

Why is this important? Everybody has blind spots and we reach better architecture and design decisions when we seek out feedback and diverse perspectives.

⭑ Whisper Wins. Sunlight Failure.

We should strive to shine as much light on our failures as possible so that the organization can learn from the mistakes. Failure is an inevitable part of the process. We celebrate our wins but we should be humble and avoid building a culture of bragging.

Why is this important? We want to create a culture that thinks big and takes risks. Many things we try will fail and we want to learn from failure. We should never stigmatize failing.

⭑ Retrospective Culture and SEVs

Processes like retrospectives and SEVs (Site EVents) are for learning, never for blaming. We learn from our outages and our mistakes and make improvements to our infrastructure and tooling to try to prevent similar issues in the future.

Why is this important?

“If you’re not making mistakes, then you’re not doing anything. I’m positive that a doer makes mistakes.” -John Wooden

Craftsmanship

Craftsmanship is generally defined as skill in building something that is of high quality where great attention was paid to the details and no corners were cut. This is an aspirational goal and it’s often in tension with many of our other principles (e.g. Moving Fast).

At Nextdoor, we do not adopt a pure craftsmanship approach as that would almost certainly mean we were prioritizing perfection over-delivering value, and moving fast. Instead, we should always understand what a well-crafted solution looks like. When we compromise, it should never be because we were lazy or sloppy. Rather, it should always be due to a tradeoff that was made in service of our purpose. We should take great pride in what we build. A compromise that is made for the right reason can still be a point of pride.

Let us know in the comments what are the principles that resonated with you and if you have examples or other engineering principles that you apply within your teams. And also check out our Careers page, we have many open opportunities across all our teams and functions.

The sticky notes from the first brainstorming session on engineering principles.

Engineering Principles (v1) at Nextdoor was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.