<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Nextdoor Engineering - Medium]]></title>
        <description><![CDATA[Nextdoor is the neighborhood hub for trusted connections and the exchange of helpful information, goods, and services. We believe that by bringing neighbors together, we can cultivate a kinder world where everyone has a neighborhood they can rely on. - Medium]]></description>
        <link>https://engblog.nextdoor.com?source=rss----5e54f11cdfdf---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Nextdoor Engineering - Medium</title>
            <link>https://engblog.nextdoor.com?source=rss----5e54f11cdfdf---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 10 Jun 2026 06:48:24 GMT</lastBuildDate>
        <atom:link href="https://engblog.nextdoor.com/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Scaling Nextdoor’s Datastores: Part 5]]></title>
            <link>https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-5-5221da60f374?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/5221da60f374</guid>
            <category><![CDATA[database-consistency]]></category>
            <category><![CDATA[cache-invalidation]]></category>
            <category><![CDATA[database-scalability]]></category>
            <category><![CDATA[caching-strategies]]></category>
            <category><![CDATA[rdbms]]></category>
            <dc:creator><![CDATA[Slava Markeyev]]></dc:creator>
            <pubDate>Wed, 19 Mar 2025 15:09:15 GMT</pubDate>
            <atom:updated>2025-03-19T15:18:48.758Z</atom:updated>
            <content:encoded><![CDATA[<p>In this final installment of the Scaling Nextdoor’s Datastores blog series, we detail how the Core-Services team at Nextdoor solved cache consistency challenges as part of a holistic approach to improve our database and cache scalability and usability.</p><p>In <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-4-c9d3d3edcd34">Part 4: Keeping the cache consistent</a>, we highlighted a class of consistency issues arising from racing cache writes and introduced an approach for forward cache versioning as a mechanism to avoid inconsistencies. The cache is able to decide which write to persist and which to reject because it is aware of the version of data it currently has. However, this is only a partial solution because it assumes writers will always succeed in communicating with the cache in a timely manner, if at all.</p><h3>Missed Writes</h3><p>Let’s consider the scenario where <em>Writers A</em> and <em>B</em> both performed an update to the same row in the database and have not yet updated the cache. <em>Writer A</em> holds <em>Version 1</em> and <em>Writer B</em> holds <em>Version 2</em>. What happens if <em>Writer </em>B with <em>Value 2</em> fails to talk with the cache?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/615/0*GSlT6qBJGte5Og16" /><figcaption>Writer B fails to write to the cache.</figcaption></figure><p>In this case the result is that the cache becomes inconsistent and we can’t rely on the writers to provide that consistency. A process must exist outside of this interaction to fix-up the cache when Version 2 is written to the database but fails to be written to the cache.</p><h3>Change Data Stream</h3><p>To solve this problem we tap into a common feature provided by most modern databases, a Change Data Capture (CDC) Stream. A CDC Stream is a mechanism to subscribe to row level changes in a database.</p><p>The change stream contains a row’s previous column values along with the new values. Here’s a visual example of the change stream when the <em>last_name</em> field gets updated in the database.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/909/0*SohlgRgkJIGz8iOy" /></figure><p>For visual clarity the changed values have been underlined in red.</p><h4>Reconciler</h4><p>Since the database is the source of truth and the CDC Stream emits all changes, a consumer of this stream can clean up any consistency issues in the cache. In our system we call this process reconciliation and it’s performed by the ”Reconciler.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/742/0*6rg2McNY76b3Gx4y" /></figure><p>The reconciler is responsible for converting CDC change information into input for the <em>del_if_version </em>call to the cache. The arguments for the <em>del_if_version </em>are the version and cache key.</p><p>As discussed in part 4, <em>del_if_version </em>performs conditional deletion of data in the cache (as evaluated by the cache itself) if the supplied version is less than the provided version.</p><blockquote>Author’s Note: Having both old and new column values is particularly important when your cache key is comprised of row values or if you have secondary cache keys for unique column values.</blockquote><p>For the above example the <em>del_if_version </em>function call would have parameters <em>version=2 </em>and <em>key=lastnames:1</em>. The key parameter is made up of two parts, the table name (lastnames) and the primary key for the specific record (row 1).</p><h4>Reconciling Missed Cache Writes</h4><p>Having covered the building blocks, let’s reconsider the case of a missed write. In the below diagram we show that the Reconciler is able to delete <em>Version 1</em> from the cache in the event <em>Writer B </em>does not succeed in updating the cache with <em>Version 2</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/876/0*Xavx9keXX95S7fGe" /></figure><blockquote>Author’s Note: Astute readers will point out that we could have avoided this issue if the database change stream was used to update the cache. The caveat to this approach is that if you want to keep a look-aside cache design, you then must maintain two different methods of serialization (from ORM objects and from the raw change stream) in sync.</blockquote><h4>Missed Write During Cache Fill</h4><p>Thus far we’ve only discussed writers doing database updates and then updating the cache. However, anytime a reader checks the cache and gets a miss, the reader must populate the cache after fetching from the database.</p><p>The last scenario to consider is what happens when a reader is populating the cache due to a cache miss while a writer is performing a write? Depending on the timing of the sequence of steps the cache may become inconsistent. Consider the following:</p><ol><li>Due to a cache miss the Reader reads <em>Version 1</em> from the database</li><li>The Writer reads <em>Version 1</em> from the database, performs business logic, and writes <em>Version 2</em> to the database</li><li>The Writer fails to write <em>Version 2</em> to the cache</li><li>The Reconciler receives <em>Version 2</em> and deletes <em>Version 1 </em>from the cache</li><li>After some time the Reader re-populates <em>Version 1</em> in the cache</li></ol><p>The end state is the cache has Version 1 and is now inconsistent with the database.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/880/0*U7zXeUgaFU9kJYjw" /></figure><h4>Reconciling Missed Write During Cache Fill</h4><p>The solution to this consistency problem is conceptually simple. The Reconciler must lag behind such that Step 4 in the above diagram happens <em>after </em>Step 5. However, a problem arises in that until the reconciler deletes the stale version in the cache, the cache remains inconsistent.</p><p>For our desired consistency level we wanted to have the Reconciler fix-up the most commonly occurring issues as quickly as possible. In practice missing writes are extremely uncommon for us.</p><p>A simple solution to this problem is having two reconcilers. One instance is reading the change stream and applying fixes to the cache in near real time while another one is <strong>always</strong> a fixed amount of time behind performing any final mopping up. In our case this delay was slightly higher than our web request timeout.</p><blockquote>Author’s Note: There are a handful of other scenarios where a cache inconsistency may occur, but those are fairly complex to articulate. We encourage readers to think through the cases of incomplete and out of order steps. Readers will find that the two-pass reconcilation handles those cases.</blockquote><h3>Reconciliation Pipeline Implementation</h3><p>The discussion about reconciliation has, until now, been theoretical, without addressing the key implementation details and design goals. One functional requirement was to guarantee that users have a cohesive experience, such as seeing their own writes reflected after a page refresh. This meant that reconciliation needed to happen in near real time to provide a seamless experience in the event of a missed cache write.</p><p>The choice to use row versions for conditional deletion of cache keys has a subtle but important consequence: changes do not need to be processed in order. This conscious design choice meant that we could horizontally scale our processing in order to satisfy low latency reconciliation.</p><p>The Reconciler, which we have only discussed conceptually until now, is actually made up of three different pieces.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rrt7folsOuBXUVNC" /></figure><h4>Pg-bifrost</h4><p><a href="https://github.com/Nextdoor/pg-bifrost">Pg-bifrost</a>, an open-source tool previously developed by our team, consumes the PostgreSQL WAL Replication log (CDC Stream) and republishes it for use in other applications. It was built with low latency republishing in mind.</p><h4>Apache Kafka</h4><p>Kafka was selected as our persistent message bus for the WAL change stream from the database due to its low latency, persistence guarantees, and well-supported consumer and producer APIs. As Nextdoor was already utilizing Kafka, it was a natural choice for our team.</p><h4>Reconciler</h4><p>The reconciler is simply a GoLang based Kafka consumer that reads the WAL change stream and executes Redis <em>del_if_version</em> calls. The two-pass reconciliation, discussed earlier, was implemented using a time wheel to maintain a fixed offset while remaining relatively straightforward.</p><h3>Putting It All Together</h3><p>This series began by exploring typical relational database scalability issues and the questions teams must consider when addressing them. <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-2-513922e4b4b1">Part 2 </a>presented an approach to improve read replica usage with semi-intelligent routing. <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-3-e9b4dd8a9393">Part 3</a> discussed the importance of serializing cached data and its impact on scalability. <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-4-c9d3d3edcd34">Part 4</a> outlined a cache consistency strategy using row versions and conditional upserts. The final installment, Part 5, highlighted consistency issues and profiled a reconciliation system to maintain database-cache consistency.</p><p>This multifaceted strategy solved the following scalability challenges for our datstores:</p><ul><li>Reduced the load on our primary PostgresSQL databases</li><li>Ensured that only forward versions of database rows were cached for a limited time</li><li>Prevented thundering herds to the cache and databases when database schemas changed</li><li>Made better use of read replicas, which enable effective horizontal scalability for RDBMS</li></ul><h3>Future Work</h3><p>This project was a stepping stone on our database scalability journey. Keen readers likely noticed that our caching system is limited to retrieving items by their primary key or unique attributes. This is due to the fact that key-value storage in a cache is best suited for these types of lookups, and they are essentially an extension of unique database indexes.</p><p>The next step in our database scalability journey is to enable the storage of lists within the cache. This enhancement will allow us to efficiently handle common database queries, such as “<em>Give me a list of people in a neighborhood”</em>.</p><h3>Parting Thoughts From the Authors</h3><p><a href="https://medium.com/u/d8db977cf129">Slava Markeyev</a>: Technologies like no-SQL and new-SQL are often thought of as magic pill solutions to scalability challenges. While they have many advantages over traditional RDBMS solutions, a datastore can fundamentally only be as scalable as the schema and query patterns allow it to be. Changing those two after the fact is akin to rebuilding the engine while the plane is in the air. Doing so is certainly possible — heck, we tried it — but that is a story for another day.</p><p>My parting thought for readers is SQL databases when used in conjunction with secondary indexes like Redis/Valkey for caching and ElasticSearch for search will get you pretty far if you let them.</p><p><a href="https://medium.com/u/e734698562ba">Tushar Singla</a>: While doing architecture interviews, the need often arises to consider how to scale the database to support 100M or even 1B+ users. A common strategy is to “scale out” or add “caching.” Typically, mentioning those concepts are enough to communicate understanding of these horizontally scaling techniques and the interview moves on. However, only after building out this system in real life at scale can one appreciate the intricacies, complexities, and myriad of edge cases that must be handled in order to serve the users appropriately.</p><p>Next time when you mention these techniques during the interview, take some time to consider what it might look like to actually finish the implementation.</p><p><a href="https://medium.com/u/3d01db79be59">Ronak Shah</a>: Scaling a datastore presents unique challenges, and there is no one-size-fits-all solution. Finding the right approach for your system requires thoughtful discussion and a rigorous design process. Although our work was complex and challenging, it was also a truly rewarding experience to be part of.</p><p>P.S. Having robust observability tools and a solid unit testing framework can save countless engineering hours spent debugging cache consistency issues.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5221da60f374" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-5-5221da60f374">Scaling Nextdoor’s Datastores: Part 5</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scaling Nextdoor’s Datastores: Part 4]]></title>
            <link>https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-4-c9d3d3edcd34?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/c9d3d3edcd34</guid>
            <category><![CDATA[lua]]></category>
            <category><![CDATA[cache-control]]></category>
            <category><![CDATA[infrastructure]]></category>
            <category><![CDATA[cache-invalidation]]></category>
            <category><![CDATA[database-consistency]]></category>
            <dc:creator><![CDATA[Ronak Shah]]></dc:creator>
            <pubDate>Wed, 19 Mar 2025 15:08:57 GMT</pubDate>
            <atom:updated>2025-03-19T15:14:49.650Z</atom:updated>
            <content:encoded><![CDATA[<p>In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache.</p><p>In this post, we’ll focus specifically on inconsistencies caused by racing writes and our solution. We’ll discuss other causes and our full solution for consistent caching in the next installment of our blog: <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-5-5221da60f374">Part 5: A time-bounded eventually-consistent cache</a>.</p><h3>Inconsistent Cache</h3><p>Caches can become inconsistent with the database for several reasons, such as:</p><ul><li><strong>Racing Writes / Concurrent Updates:</strong> Multiple writes occurring simultaneously can result in a stale cache.</li><li><strong>Missed Writes / Failing to Update the Cache:</strong> Failure to update or set cache correctly after a database write.</li><li><strong>Delayed Cache Updates or Deletes:</strong> Slow propagation of updates or invalidation can leave the cache out of sync.</li><li><strong>Application-Level Bugs:</strong> Bugs in application side caching logic.</li></ul><p>Maintaining cache consistency with the database is crucial for data accuracy, especially in distributed systems with concurrent web requests. Consistent caching ensures reliable read-after-write behavior, improving performance and user experience. Without it, applications may face unpredictable behavior and user frustration. A well-designed caching system boosts performance, ensures consistency, and delivers up-to-date data even under high concurrency.</p><h3>Racing Writes</h3><p>Let’s look at the scenario where two writers, A and B, update the same user in the database but write to the cache in a different order, causing the cache to become inconsistent with the database.</p><blockquote>Author’s Note: In examples moving forward we’ll use “User_1” to mean id=1 in the “users” table.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*9wpnUN7ebkgi8PN0" /><figcaption>Look-Aside Cache with two writers.</figcaption></figure><p>Sequence of operations:</p><ol><li><em>Writer A</em> updates the name of <em>user_1</em> to <em>Foo</em> in the database.</li><li><em>Writer B</em> updates the name of the same <em>user_1</em> to <em>Bar</em> in the database.</li><li><em>Writer B </em>updates the cache for <em>user_1</em> with <em>name = Bar.</em></li><li><em>Writer A</em> updates the cache for user_1 with <em>name = Foo</em>.</li></ol><p>As shown, the cache and database end up out of sync because the cache updates by Writer A and B occurred in a different order than the database updates. This results in an inconsistent <em>name</em> value for <em>user_1</em> between the database (<em>Bar</em>) and the cache (<em>Foo</em>). Essentially, this means stale writes were allowed in the cache.</p><p>From the above example, it’s clear that the cache allows stale copies to overwrite newer updates, leading to inconsistencies. Since execution order in distributed systems is beyond our control, our objective is to equip the cache with mechanisms to detect and reject stale updates to prevent such discrepancies.</p><h3>Our Approach</h3><p>To ensure the cache can distinguish between new changes and stale ones, we need a mechanism to identify and reject stale updates while correctly applying newer ones. This can be achieved by introducing a version identifier for database table rows and using this version identifier to reject stale cache updates.</p><h4>Adding row versions to the database</h4><p>While there are different approaches to generating row version information for database rows, the key part is that the client must not participate in the process. That is, the next version value must be generated on the database side in such a way that it is unique and monotonic.</p><blockquote>Author’s Note: Simply using a timestamp is not sufficient because it is not unique and is not always monotonic, as one might hope it would be.</blockquote><p>Our solution was to introduce a new column called db_version to our tables. To handle the incremental versioning we implemented a Postgres database trigger to handle this. This trigger increments the version for each update and initializes it to 1 for the first insert.</p><p>Postgres Database Trigger:</p><pre>CREATE OR REPLACE FUNCTION uvf_update_db_version()<br>                RETURNS TRIGGER AS $$<br>                BEGIN<br>                    IF TG_OP = &#39;INSERT&#39; THEN<br>                        NEW.db_version = 1;<br>                    ELSEIF TG_OP = &#39;UPDATE&#39; THEN<br>                        NEW.db_version = OLD.db_version + 1;<br>                    END IF;<br>                    RETURN NEW;<br>                END;<br>                $$ LANGUAGE plpgsql;<br>CREATE TRIGGER uv_update_db_version BEFORE INSERT OR UPDATE ON &lt;table_name&gt; FOR EACH ROW EXECUTE FUNCTION uvf_update_db_version();</pre><blockquote>Author’s Note: If you plan to use a trigger or UDF, ensure that your database provides the same consistency level as your update. That is, you want the row update and version update to be an atomic operation from a consistency perspective.</blockquote><p>When our application updates a row in the database, it also needs to update the cache with the new value. Since Django’s ORM does not return the updated row, we perform a select query after the update operation to retrieve the incremented version from the database. Both the update and select queries are executed within a transaction block.</p><h4>Adding Version to the Cache</h4><p>Since we added version information to database tables, we also need to include versioning in the cache to detect and reject stale writes.</p><p>To achieve this, we incorporate the version as a metadata header in the cache value. This header contains the version information and additional metadata. For more details, refer to Part 3: Appropriately serializing data for caching.</p><p>The version is included as a metadata header to maintain separation between serialization methods and cache functionality. This means the cache doesn’t need to know how to interpret the payload to get the version information from the serialized object.</p><h4>Atomic Cache Operations</h4><p>To correctly update the cache, we need to compare the supplied and stored version and perform the update as an atomic operation. Doing this on the client side is not feasible because we operate in a distributed and concurrent environment, where multiple cache updates to the same data can occur simultaneously. If the version check happens on the client side, there is a risk of stale writes, as the version may have changed after the check but before cache update.</p><p>A similar issue can exist on the cache side if the compare and update are not an atomic operation. We were able to leverage Redis’s single threaded nature to our advantage because it naturally protects against concurrent access. This allowed us to write custom Lua functions, executed by Redis, to perform conditional updates and deletes without worrying about concurrency issues.</p><h3>Custom Lua functions</h3><p>To effectively detect and reject stale writes to the cache, we implemented two Lua functions to perform conditional updates and deletes. The function stubs are:</p><pre>--- Sets a key only if the provided version is greater than the stored version or if the key does not exist.<br>-- If the supplied version is equal to or lower than the stored version, the operation is rejected.<br>-- @param key (string) Name of the key.<br>-- @param version (string) Corresponds to the database version value.<br>-- @param metadata (string) Additional header information to store.<br>-- @param value (string) Data to associate with the key.<br>-- @param ttl (number) Time-to-live for the key in seconds.<br>-- @return (string) &quot;OLD&quot; if the supplied version is lower than the stored version.<br>--                  &quot;OK&quot; if the key was set or if versions match.<br>function set_if_version(key, version, metadata, value, ttl)<br>   ...<br>end<br><br>--- Deletes a key only if the stored version is less than the supplied version.<br>-- @param key (string) Name of the key.<br>-- @param version (string) Corresponds to the database version value.<br>-- @return (number) 0 if the key was not deleted or did not exist.<br>--                 1 if the key was deleted.<br>--                -1 if the key was not deleted because the stored version is greater than the supplied version.<br>function del_if_version(key, version)<br>   ...<br>end</pre><h3>Racing Write Example</h3><p>Let’s bring everything together and see how we handle the racing write problem and prevent stale writes to cache.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*0i4YO8rHBObpfNen" /></figure><p>Sequence of operations:</p><ol><li>Writer A updates the name of User_1 to Foo in the database.</li><li>Writer A receives db_version=12 from database</li><li>Writer B updates the name of the same User_1 to Bar in the database</li><li>Writer B receives db_version=13 from database</li><li>Writer B updates the cache for User_1 with version=13 using set_if_version</li><li>Writer B receives OK from set_if_version call since it updated the cache</li><li>Writer A updates the cache for User_1 with version=12 using set_if_version</li><li>Writer A receives OLD from set_if_version call since it supplied old version and the cache update was rejected</li></ol><p>With this solution, we can effectively detect and prevent stale writes to the cache.</p><h3>Conclusion</h3><p>To address cache inconsistencies due to racing writers, a look-aside cache must have a way to reject stale writes. This will mean each database row version will need a unique and monotonic version and the cache must perform conditional updates in a serializable manner.</p><p>This approach is a building block to solve more advanced consistency edge cases such as missed writes and racing cache fills. To learn more about these issues and the solution we implemented, stay tuned for <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-5-5221da60f374">Part 5: A Time-Bounded Eventually-Consistent Cache</a> of the Scaling Nextdoor’s Datastores blog series.</p><p>Authors: <a href="https://medium.com/u/3d01db79be59">Ronak Shah</a>, <a href="https://medium.com/u/d8db977cf129">Slava Markeyev</a>, and <a href="https://medium.com/u/e734698562ba">Tushar Singla</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c9d3d3edcd34" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-4-c9d3d3edcd34">Scaling Nextdoor’s Datastores: Part 4</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scaling Nextdoor’s Datastores: Part 3]]></title>
            <link>https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-3-e9b4dd8a9393?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/e9b4dd8a9393</guid>
            <category><![CDATA[cache-invalidation]]></category>
            <category><![CDATA[valkey]]></category>
            <category><![CDATA[serialization-format]]></category>
            <category><![CDATA[caching]]></category>
            <category><![CDATA[lua]]></category>
            <dc:creator><![CDATA[Ronak Shah]]></dc:creator>
            <pubDate>Wed, 19 Mar 2025 15:08:43 GMT</pubDate>
            <atom:updated>2025-03-24T05:28:28.852Z</atom:updated>
            <content:encoded><![CDATA[<p>In this part of the Scaling Nextdoor’s Datastores blog series, we’ll explore how the Core-Services team at Nextdoor serializes database data for caching while ensuring forward and backward compatibility between the cache and application code.</p><p><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-1-234d0cf67665">In part 1 of this series</a> we discussed how ORMs, object-relational mapping frameworks, help abstract away database specific schemas and queries from application code. Developers simply utilize objects in their application’s language to access database data.</p><p>Here’s a simple example of using Python’s Django ORM to define a <a href="https://docs.djangoproject.com/en/5.1/topics/db/models/">model</a>:</p><pre>from django.db import models<br><br>class Users(models.Model):<br>    first_name = models.CharField(max_length=30)<br>    last_name = models.CharField(max_length=30)</pre><p>The associated SQL create table would look like:</p><pre>CREATE TABLE users (<br>    &quot;id&quot; bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,<br>    &quot;first_name&quot; varchar(30) NOT NULL,<br>    &quot;last_name&quot; varchar(30) NOT NULL<br>);</pre><p>Developers would then access database data like this:</p><pre>user_id = 123<br>user = User.objects.get(id=user_id)<br>print(user.first_name)</pre><h3>Object Byte Serialization for Caching</h3><p>An issue arises when adding a look-aside cache such as Redis/Valkey to an application: <em>How do you store what you got from the database in the cache?</em></p><p>A common solution to caching complex objects, such as those from ORMs, is object byte serialization. This process converts language objects into bytes before storing them in the cache. When reading from the cache the process is done in reverse where the byte data is turned into language objects. For instance in Python this is often done with the <a href="https://docs.python.org/3/library/pickle.html"><em>pickle</em></a> package.</p><p>The interaction between the application, database, and the cache looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GEpMGtVmjIFT61l7qAi9SQ.png" /><figcaption>Look-Aside Cache</figcaption></figure><pre>import pickle<br><br># Try getting from cache (&#39;None&#39; if not in cache)<br>user_bytes = cache.get(&quot;user_123&quot;)<br><br>if user_bytes is not None:<br>   # Read bytes using pickle<br>   user = pickle.loads(user_bytes)<br>else:<br>   # Fetch from database<br>   user = User.objects.get(id=123)<br>   <br>   # Convert to bytes<br>   user_bytes = pickle.dumps(user)<br><br>   # Store in cache<br>   cache.set(&quot;user_123&quot;, user_bytes)</pre><p>While this method enhances performance by reducing database queries and leveraging cached data, it also presents challenges, particularly when serialized data is tightly coupled to specific runtime environments, package versions, and schema definitions.</p><h4>Issues with Byte Serialization</h4><p><strong>Bound to Runtime Version and Package Version: </strong>Serialized objects often embed information about the object’s structure as defined by the code at the serialization time. When the code or its dependencies are updated (e.g., a new release), serialized objects may fail to be deserialized, making the cache data incompatible with the new code version.</p><p><strong>Thundering Herd Problem during Migrations: </strong>When a schema migration occurs (e.g. adding a new field/column), the cache might suddenly contain a mix of old and new serialized data. As a result it can force the application to treat many cache entries as misses due to deserialization failures. If the cached items are large or in high demand (“hot”), a simultaneous cache miss across many clients can result in a “thundering herd” effect. In this scenario, numerous processes will concurrently query the database to refill the cache, placing excessive load on the database.</p><blockquote>Author’s Note: The implied solution to deserialization errors is to query the database and re-fill the cache.</blockquote><h4>Forward and Backward Compatibility</h4><p>To address the challenges discussed above, it is important to design a cache serialization strategy that ensures both forward and backward compatibility.</p><ul><li><strong>Forward Compatibility</strong>: Older versions of the application should be able to read cache entries that were written by newer application versions.</li><li><strong>Backward Compatibility: </strong>New versions of the application should be capable of reading cache entries that were written by previous application versions.</li></ul><h3>Serialization format of choice</h3><p>When evaluating our options, we compared various serialization formats based on versioning compatibility, the performance of serialization and deserialization, and the resulting serialized byte size. Ultimately, we chose <a href="https://msgpack.org/index.html">MessagePack</a> to serialize Django Model objects.</p><h4>Forward and Backward Compatibility</h4><p>For forward compatibility, the MessagePack Python library can ignore new fields during deserialization if they don’t exist in the current version of the code. This ensures that older versions of the code can still deserialize cache entries from an updated data model.</p><blockquote>Author’s Note: Since we rarely remove fields from Django Models, older code typically doesn’t encounter missing fields in the cache. When field removal is necessary, we use Django’s <a href="https://docs.djangoproject.com/en/5.1/ref/migration-operations/#django.db.migrations.operations.SeparateDatabaseAndState">SeparateDatabaseAndState</a> to decouple database schema changes from model updates.</blockquote><p>For backward compatibility, when the new code version deserializes old cache entries, it populates any missing fields from the cache with their default values. We require developers to provide default values when adding new fields.</p><h4>Performance</h4><p>To evaluate MessagePack’s performance, we conducted extensive tests comparing its standalone serialization/deserialization performance and its performance when integrated with our application. Our findings showed that MessagePack is slower than some alternatives (e.g., pickle). However, in our application tests, this difference did not noticeably affect overall latency when handling web requests.</p><h4>Serialization Byte Size</h4><p>MessagePack produces a smaller byte stream compared to other formats. Additionally, when combined with compression methods, we can further reduce the size of the serialized data.</p><h3>Implementation Details</h3><p>To support additional use cases and potential future revisions to cache storage, such as compression, we prepend additional information to the serialized value prior to writing it to the cache.</p><p>The prepended header is composed of two parts: a 2-byte metadata field and an 8-byte (64 bit) version field. For example, a serialized value might look like this:</p><pre>\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01&lt;messagepack bytes&gt;</pre><ul><li>The first 2 bytes (<em>\x00\x00</em>) represent the metadata where we store serialization format information.</li><li>The following 8 bytes (<em>\x00\x00\x00\x00\x00\x00\x00\x01</em>) encode object version information.</li><li>The remaining bytes are the MessagePack-serialized data.</li></ul><p>We’ll explore the role and necessity of the version information bytes, and how we use them in <a href="https://medium.com/@ronakts/c9d3d3edcd34">Part 4: Keeping the Cache Consistent.</a></p><h3>Conclusion</h3><p>Ensuring forward and backward compatibility in caching alone wasn’t enough to deliver accurate data to users — we also needed a way to maintain cache consistency with our database. Maintaining cache consistency is crucial in a distributed environment, where we are handling concurrent web requests while delivering up-to-date data to users. To learn more about the importance of cache consistency and how we achieve it, check out <a href="https://medium.com/@ronakts/c9d3d3edcd34">Part 4: Keeping the Cache Consistent</a> in the Scaling Nextdoor’s Datastores blog series.</p><p>Authors: <a href="https://medium.com/u/3d01db79be59">Ronak Shah</a>, <a href="https://medium.com/u/d8db977cf129">Slava Markeyev</a>, and <a href="https://medium.com/u/e734698562ba">Tushar Singla</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e9b4dd8a9393" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-3-e9b4dd8a9393">Scaling Nextdoor’s Datastores: Part 3</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scaling Nextdoor’s Datastores: Part 2]]></title>
            <link>https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-2-513922e4b4b1?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/513922e4b4b1</guid>
            <category><![CDATA[database-scalability]]></category>
            <category><![CDATA[rdbms]]></category>
            <category><![CDATA[orm]]></category>
            <category><![CDATA[read-replica]]></category>
            <dc:creator><![CDATA[Tushar Singla]]></dc:creator>
            <pubDate>Wed, 19 Mar 2025 15:08:32 GMT</pubDate>
            <atom:updated>2025-03-19T15:11:11.911Z</atom:updated>
            <content:encoded><![CDATA[<p>In the second installment of Nextdoor’s “Scaling Nextdoor’s Datastores” blog series, the Core-Services team discusses challenges faced after implementing database read replicas.</p><p>Adding read replicas to an existing database is a very common pattern as applications or products evolve to handle increased demand. Typically, the implementation details are hand waved and it’s assumed that this strategy will work. However, that is rarely the case, and we’ll dive into some more of the intricacies around the implementation.</p><h3>Initial Attempt</h3><p>When replicas were first introduced in the Nextdoor stack, we gave the product engineers latitude to choose when they wanted to have their query routed to a read replica or to the primary. This was done by leveraging the existing routing mechanism in our ORM, Django.</p><p>This seemed like the right idea at the time because the product engineers had the most context around consistency requirements within their changes and load characteristics of their product feature. Therefore, they would have the best ability to judge which node to send their query to. However, as our business logic evolved and became more feature-rich, product engineers began to add abstraction layers to help abstract complex operations away from business logic.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1rQN0RVBezWrRK5k" /><figcaption>In this design evolution there is a high frequency read, followed by a low frequency conditional write, followed by a read. The read performed after the write should be routed to the primary, but that may get buried in abstractions and this requirement regressed.</figcaption></figure><p>The explicit routing decisions engineers made became buried and subsequently created a serious problem for users of these abstractions. If one abstraction method was performing a write and another a read, they could not safely be used together due to read-after-write consistency issues. Due to replication lag between the primary and replica databases, a race condition arises when the application attempts to read data from a replica after performing a write.</p><p>We had created a system where engineers had to be aware of <em>the entire call stack</em> and be able to determine if this situation would apply to them or not. The easiest way to handle this situation? Always use the primary…</p><h3>The Band-aid</h3><p>A common solution engineers employed was to wrap higher-order business logic in database transactions because within the context of a transaction, all queries are routed to the primary.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1001/0*4mF26LwoHrXVh7C5" /><figcaption>Shows that the choice to route to primary is made in the abstraction layer so functions A, B, and C all use the primary whether they need to or not.</figcaption></figure><blockquote>Author’s Note: Our default isolation level for transactions, repeatable read, gave engineers a false sense of security, as it only guarded against replication lag and not racing writes with concurrent transactions. We have since improved this to ensure read-your-own-write semantics.</blockquote><p>This strategy had a negative effect on database load because it indiscriminately caused all queries intended to be sent to replica databases to be sent to the primary database. The impact of this problem increased as:</p><p>1) more business logic leveraged the database</p><p>2) business logic increased its use of existing abstractions</p><p>3) query performance decreased as more data was added</p><p>What transpired was a years-long erosion of the capacity benefits the additional read replicas initially provided. As a result, the load on the primary database node became one of the most pressing issues with Nextdoor’s database stacks.</p><p>It was clear that the initial approach of exposing routing choices to engineers was no longer tenable and the team embarked on a way of making the replica vs. primary decision for the product engineers.</p><h3>Reimagining</h3><p>Using ORMs (Object Relation Mappings) is controversial. There are many pros and cons and we won’t debate all of them here. However, one of the advantages of an ORM is that there is a consistent layer of abstraction between the database and the application. This allowed us to inject a simple piece of custom logic to keep track of which tables have been written to while processing a web request. Why is it helpful to keep track of modified tables? By doing this we could automatically make informed decisions of where to route subsequent read queries, regardless of where they were buried in the business logic stack.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/714/0*QvQjV7hguzuIsDil" /></figure><p>This simple strategy, coupled with our already high read-to-write ratio, allowed us to shift much of the read traffic to the replica databases and substantially reduce our reliance on the primary database.</p><blockquote>Author’s Note: While the strategy was simple, we did have to cleanup up all of the manual routing decisions along with inappropriate usages of transactions.</blockquote><p>While this naive approach was rather effective, we realized that this strategy was actually too conservative. We noticed that our average database replication lag was around 20ms while our web requests lasted an order of magnitude longer. That means that even after the update had been replicated to the primary, we were still disallowing queries for that table to the read replica. This provided an opportunity to use a timing based system that marked tables as re-eligible after p99.9 replication lag had elapsed. With this additional optimization, we were able to re-route most of the queries from our primary to the read replicas.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NdfpK4fefVq2orbg" /></figure><h3>Takeaway</h3><p>Each application will have its own eccentricities that make RDBMS scalability a challenge for platform teams but we hope this post provides a cautionary tale, as well as a potential solution for similar cases.</p><p>In the <a href="https://medium.com/@ronakts/e9b4dd8a9393">next post, <em>Appropriately serializing data for caching</em></a>, we’ll cover how we serialize database data for caching and some of the pitfalls to be aware of when introducing caching to an application.</p><p>Authors: <a href="https://medium.com/u/e734698562ba">Tushar Singla</a>, <a href="https://medium.com/u/d8db977cf129">Slava Markeyev</a> and <a href="https://medium.com/u/3d01db79be59">Ronak Shah</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=513922e4b4b1" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-2-513922e4b4b1">Scaling Nextdoor’s Datastores: Part 2</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scaling Nextdoor’s Datastores: Part 1]]></title>
            <link>https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-1-234d0cf67665?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/234d0cf67665</guid>
            <category><![CDATA[scalability]]></category>
            <category><![CDATA[database-consistency]]></category>
            <category><![CDATA[caching]]></category>
            <category><![CDATA[rdbms]]></category>
            <category><![CDATA[database-scalability]]></category>
            <dc:creator><![CDATA[Slava Markeyev]]></dc:creator>
            <pubDate>Wed, 19 Mar 2025 15:08:13 GMT</pubDate>
            <atom:updated>2025-03-19T15:21:44.385Z</atom:updated>
            <content:encoded><![CDATA[<p>At Nextdoor, the Core-Services team is responsible for the primary set of databases and caches that power the Nextdoor platform. This blog series explores our 2024 initiatives to enhance the scalability of this critical infrastructure. When we sat down at the whiteboard we sought to address two related problems:</p><ol><li>How can we reduce load on our primary database(s) and better utilize database read replicas?</li><li>How can we improve our cache consistency?</li></ol><p>In this post we’ll provide a primer on the common industry-wide solutions we’ve previously employed along with discussing their caveats and pitfalls. In subsequent posts we’ll dive into the technical details of the components of our solution and how they fit together.</p><h4>Table of Contents</h4><ol><li>Background primer (this post)</li><li><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-2-513922e4b4b1">Decreasing database load with dynamic routing</a></li><li><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-3-e9b4dd8a9393">Appropriately serializing data for caching</a></li><li><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-4-c9d3d3edcd34">Keeping the cache consistent</a></li><li><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-5-5221da60f374">A time-bounded, eventually-consistent cache</a></li></ol><h3>Background</h3><p>Nextdoor’s backend, built using the Python-based Django web framework, powers the core product experience for neighbors, government agencies, and local businesses. The power of Django and similar frameworks (Rails, Spring, etc) is that they allow development teams to focus on implementing business logic rather than getting caught up in the details like learning and writing SQL.</p><p>The Object Relational Mapping, ORMs, included in these frameworks provide a lever that allows developers to define data models and relationships between them in the application’s language without ever needing to worry about SQL.</p><p>As some readers are all too aware, relational data modeling comes at a cost. Without careful data modeling, performant access to relational data largely depends on that data residing on monolithic databases.</p><p>NoSQL or distributed SQL datastores are often advertised as solutions to the scalability challenges of relational databases like PostgreSQL. However, many companies face significant obstacles in transitioning to these modern datastores. Their relational data models are deeply entrenched, and much of their business logic relies heavily on the power of relational data.</p><p>This was the position the Core-Services team at Nextdoor found ourselves in at the end of 2023. We had explored changing our data model to one that could properly leverage distributed SQL but we ran into existential challenges supporting existing relational queries, specifically multi JOINs, that were scattered across our codebase. Our small team simply could not rewrite business logic to avoid these queries and we couldn’t stop all product development to have other teams do this either.</p><h3>Common solutions</h3><p>Before diving into our most recent solution, it’s worthwhile to discuss how we got to this point. Many platform teams often begin with a basic architecture of the application, via an ORM, talking directly to a relational database like PostgreSQL.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/694/1*rwrOyk_w9R7gg90yzfn5MQ.png" /></figure><h3>Caching</h3><p>A ubiquitous pattern that arises in the backend infrastructure lifecycle is adding a cache to take load off of the database. A popular solution, one that Nextdoor employed, was adding a look-aside cache powered by Redis.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/684/0*iQ7kvT5Cg7WHhqJL" /></figure><p>When using a cache, a few common points of consideration arise:</p><ul><li>You likely want to maintain the flexibility the ORM provides with the benefits of caching. Due to the ORM’s flexibility in querying complex relations, not all queries will be cache-able. What then?</li><li>If performing a read with the intention of doing a write, engineers probably don’t want stale data from the cache. An escape hatch must be built in.</li><li>How do product engineers interact with the cache and database?</li><li>What is actually stored in the cache and how is it formatted? Will the application still be able to read what’s in the cache if the data model / schema changes?</li><li>If the data in the database changes, how does the cache get updated?</li></ul><p>To be clear, not all of these points need to be addressed when initially deploying a cache but they do creep up over time. Like many companies, Nextdoor solved these problems as they started to appear on the horizon. Later parts of this blog series cover components that solve many of these problems, so stay tuned!</p><h3>Read Replicas</h3><p>Another common infrastructure improvement is adding database read replicas to serve queries that can’t be answered by the cache.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/708/0*u_2zAmXUwfj27C_d" /></figure><p>This approach helps address database load on the primary but does not entirely solve the problem. Common issues that arise:</p><ul><li>Engineers may not be able to reason about the various consistency pitfalls that come with this architecture, nor should they if their goal is to simply build product features. If an option is provided, they will choose the perceived safest one: sending their queries to the primary database.</li><li>Business logic will often need to perform some reads followed by a write. This will typically be wrapped in a transaction which will mean this must run on the primary database.</li><li>In order to avoid consistency problems, populating the cache on a cache-miss will likely need to be done by querying the primary database. However, even if the primary database is queried, it does not guarantee that what gets written into the cache is the most up to date version of the data.</li></ul><h3>Data Partitioning</h3><p>Another common solution which teams will employ is splitting up their data across multiple physical databases. This strategy has several different approaches such as splitting by tenancy or breaking up foreign-key relations. Due to our existing data model and access patterns we chose the latter option of severing foreign-key relations such that a set of tables could be moved into their own physical database. While our ORM handled routing to the appropriate database depending on what was being accessed, it did require us to carefully rewrite some business logic.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/695/0*5-Totz_scg2RTLJ7" /></figure><p>This comes along two key caveats:</p><ul><li>This strategy is valid and may be the only option but it only prolongs the runways and does not solve the fundamental problem that a physical database can only be so big.</li><li>Transactions which exist in business logic may now be subtly broken in that they no longer provide atomicity when dealing with successive writes to different databases.</li></ul><p>Teams like ours may eventually find themselves back at square one where the primary database(s) are still a bottleneck and single point of failures despite the efforts to mitigate these problems. Through thoughtful data modeling and incremental work, the problem of database scalability doesn’t have to be immediately existential to businesses. Despite all of the caveats, the above strategies will add years of runways before the problem is well and truly existential.</p><h3>Next Evolution</h3><p>When reevaluating our architecture and system behaviors, we aimed to alleviate the database load issue while also addressing complementary problems discussed in the caveats listed above. Specifically we wanted to:</p><ol><li>Automatically route database queries to read replicas whenever possible.</li><li>Perform cache filling from the database read replicas.</li><li>Provide guaranteed eventual consistency of the cache in a timely manner.</li><li>Allow the application to be able to use what was in the cache even if a developer added a new field/column to a data model.</li></ol><p>Check out <a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-2-513922e4b4b1">the next post in this series</a> on how we employed dynamic routing to shift load from our primary database to the read replicas.</p><p>Authors: <a href="https://medium.com/u/d8db977cf129">Slava Markeyev</a>, <a href="https://medium.com/u/e734698562ba">Tushar Singla</a>, and <a href="https://medium.com/u/3d01db79be59">Ronak Shah</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=234d0cf67665" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/scaling-nextdoors-datastores-part-1-234d0cf67665">Scaling Nextdoor’s Datastores: Part 1</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Let AI Entertain You: Increasing User Engagement with Generative AI and Rejection Sampling]]></title>
            <link>https://engblog.nextdoor.com/let-ai-entertain-you-increasing-user-engagement-with-generative-ai-and-rejection-sampling-50a402264f56?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/50a402264f56</guid>
            <category><![CDATA[notifications]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[a-b-testing]]></category>
            <category><![CDATA[model-training]]></category>
            <category><![CDATA[chatgpt]]></category>
            <dc:creator><![CDATA[Jaewon Yang]]></dc:creator>
            <pubDate>Mon, 16 Oct 2023 17:03:52 GMT</pubDate>
            <atom:updated>2023-10-16T17:03:52.706Z</atom:updated>
            <content:encoded><![CDATA[<p>Generative AI (Gen AI) has demonstrated proficiency in content generation but does not consistently guarantee user engagement, mainly for two reasons. First, Gen AI generates content without considering user engagement feedback. While the content may be informative and well-written, it does not always translate to increased user engagement such as clicks. Second, Gen AI-produced content often remains generic and may not always provide the specific information that users seek.</p><p>Nextdoor is the neighborhood network where neighbors, businesses, and public agencies connect with each other. Nextdoor is building innovative solutions to enhance the user engagement with AI-Generated Content (AIGC). This post outlines our approach to improving user engagement through user feedback, specifically focusing on Notification email subject lines. Our solutions employ Rejection sampling [1], a technique used in reinforcement learning, to boost the engagement metrics. We believe our work presents a general framework to drive user engagement with AIGC, particularly when off-the-shelf Generative AI falls short in producing engaging content. To the best of our knowledge, this marks an early milestone in the industry’s successful use of AIGC to enhance user engagement.</p><h3>Introduction</h3><p>At Nextdoor, one of the ways to drive user growth and engagement on platform is through emails. One of the emails we have is called New and Trending <a href="https://engblog.nextdoor.com/nextdoor-notifications-how-we-use-ml-to-keep-neighbors-informed-57d8f707aab0">notifications</a>, where we send a single post that we think the user might be interested in and want to engage with. As part of sending an email, we need to determine a subject line of the email for the email audiences. Historically, we simply pick the first few words of the post being sent to be the subject line. However, in certain posts, these initial words are often greetings or introductory remarks and may not provide valuable information to the user. In the provided image example below, we observe a simple greeting, “Hello!”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/707/0*1bUjaLCj0AOjUIe0" /><figcaption>Figure 1. New and Trending email where we show a single post. Prior to the Gen AI systems we build, we use the first words of the post as the subject line (Life and Mother Nature always find a way!)</figcaption></figure><p>In this work, we aim to use Generative AI technologies to improve the subject line. With Generative AI, we aim to generate informative and interesting subject lines that will lead to more email opens, clicks and eventually more sessions.</p><p>Writing a good subject line with Generative AI is challenging because the subject line needs to satisfy the following criteria. First and foremost, the subject line needs to be engaging so that the users want to open the email. To see if ChatGPT API can write engaging subject lines, we tried generating subject lines with ChatGPT API with a small traffic A/B test, and found that the users are less likely to click on emails if we use subject lines made by ChatGPT API (e.g. Table 1). As we show later, we tried to improve the prompts (prompt engineering) but the results were still inferior to the user-generated subjects. This finding implies that Generative AI models are not trained to write the content that is particularly engaging to our users, and we need to guide Generative AI models to increase user engagement.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/991/1*L4Uumzj8svXajgd07AUOtQ.png" /><figcaption>Table 1. Subject line made by ChatGPT API and its CTR. ChatGPT API’s subject line is more informative but looks like a marketing phrase, and produced only 56% clicks compared to the user-generated subject line.</figcaption></figure><p>Second challenge is that the subject line needs to be authentic. If it reads like a marketing phrase, the email will look like spam. The example in Table 1 “Support backyard chickens in Papillion, NE!” shows this issue.</p><p>Third, the subject line should not contain hallucinations (a response that is nonsensical or not accurate). And it is well known that Generative AI is vulnerable to hallucinations [2]. For example, given a very short post saying “Sun bathing ☀️”, ChatGPT API in Table 1 generated the subject line “Soak Up the Sun: Tips for Relaxing Sun Bathing Sessions”, which had nothing to do with the post content.</p><p>We developed a novel Generative AI method to overcome the three challenges faced by the ChatGPT API mentioned above. We made three contributions:</p><ul><li><strong>Prompt engineering to generate authentic subject lines with no hallucination:</strong> Given a post, ChatGPT API creates a subject line by extracting the most interesting phrases of the post without any rewriting. By extracting the user’s original writing, we are able to prevent marketing phrases and hallucinations.</li><li><strong>Rejection sampling with a reward model: </strong>To find the most interesting subject line, we develop a reward model whose job is to predict if the users would prefer a given subject line over other subject lines. After ChatGPT API writes a subject line, we evaluate it by the reward model and accept it only if its reward model score is higher than the user-written subject line’s score. This technique is called Rejection Sampling and recently introduced to Reinforcement Learning for Large Language Model training [1].</li><li><strong>Cost optimization and model accuracy maintenance</strong>: We added engineering components to minimize the serving cost and stabilize the model performance. By using caching, we reduced our cost to 1/600 compared to the brute-force way. By daily performance monitoring, we can catch if reward models fail to predict which subject is more engaging due to external factors such as user preference drift and address it by retraining.</li></ul><p>We believe that this framework is generally applicable when off-the-shelf Generative AI fails to improve user engagement. We also analyzed the importance of each component in our design. Even with the aforementioned prompt engineering, ChatGPT API did not necessarily produce more engaging content. This highlights the necessity of the rejection sampling component: in such cases, we can develop another AI model as a reward model and use the Generative AI’s output only if the reward model approves [1].</p><h3>Proposed Method</h3><p>For every post, we employ the following system to create a subject line. It’s important to mention that we generate a single subject line for each post, without personalization. This decision was made to minimize computational cost. Exploring cost-effective methods for implementing personalized subject lines will be an interesting future work.</p><h4>Model Overview</h4><p>Figure 2 illustrates our approach. We develop two different AI models.</p><ul><li>Subject line generator: This model generates a subject line given a post content.</li><li>Reward model (Evaluator): Given a subject line and the post content, this model predicts if the given subject line would be the better subject line than the user-generated subject line.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/893/1*ejkWUk9i9BBrI74iU7g1XA.png" /><figcaption>Figure 2. Overview of our approach.</figcaption></figure><p>Given a post, the Subject line generator produces subjects in Figure 2 (green boxes). The reward model compares the OpenAI API subject line (green) with the user-generated subject line (red), and selects the more engaging one. For the top post, the OpenAI API subject line contains more relevant information and is selected. For the bottom post which was about a health alert, the reward model selects the user-generated subject. While the OpenAI API subject line shows the main content of the alert, the reward model picks the user-generated subject because it shows the importance of the post and thus is more engaging.</p><h4>Developing Subject Line Generator</h4><p>We use OpenAI API without fine-tuning. In the prompt, we require that OpenAI API extracts the most interesting part of the post without making any change. This way of extracting user content provides multiple benefits: First, it removes hallucinations. Second, it keeps the subject line authentic as OpenAI API does not rewrite the original content. To test the prompt engineering, we A/B tested generator outputs without reward models. We found that asking OpenAI API to extract in the prompt improves Sessions by 3% relatively compared to asking OpenAI API to rewrite the subject line from scratch (See the Results section for the details).</p><h4>Developing Reward Model</h4><p>We fine-tune OpenAI API to develop a reward model. This is the main innovation we applied on top.</p><p><strong>Training data collection: </strong>The challenge is to collect training data on which subject line was more engaging. Manual annotation is not possible because there are no rules deciding what subject line is more engaging. We found that the subject lines that we thought to be more engaging than the user-generated ones turned out to be less engaging (Table 2).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/956/1*WgqJAkjrCROtF0SfXCz4qA.png" /><figcaption>Table 2. Emails with a user-generated subject (left) generated 3x as many clicks as the emails with OpenAI API-generated subjects on the right.</figcaption></figure><p>To tackle this issue, we collect training data via experimentation. For each post, we generate subject lines in two ways. One way is to use user-generated ones and the other is to use the OpenAI API generator described above. Then we serve 2–3% users (~20k) that are randomly selected with each subject line. The goal is to learn which subject line was more engaging through click data.</p><p><strong>Model training:</strong> We used OpenAI API to fine-tune with the labels we collected. We used ~50k examples and 40% of examples had the OpenAI API subject as the winning subject and the rest had the user subject as the winner. Given a subject line and post content, our model is fine-tuned to predict if the subject line would generate more engagement (clicks) than the user-generated subject line. The model is asked to predict if the subject line is more engaging and output “Yes” or “No”.</p><p><strong>Training details:</strong> We used the smallest OpenAI API model “ada” for fine-tuning. We found that larger models did not improve the predictive performance despite higher cost. We added <a href="https://help.openai.com/en/articles/5247780-using-logit-bias-to-define-token-probability">a logit bias</a> of 100 for “Yes” and “No”. These biases boost the probability for the model to output “Yes” or “No”. We tried to change the number of epochs and selected the model with 4 epochs, but we did not see much difference in offline performance after 2–3 epochs.</p><p><strong>Engineering details:</strong> We added the following components for optimization and safeguarding.</p><ul><li><strong>Caching: </strong>For each post, we cache the outputs of our model. By processing each post only once, we reduced the cost to 1/600. In other words, each post gets sent 600 times on average and we process the post only once instead of 600 times. Caching also optimizes the OpenAI API usages (the number of tokens and the number of requests).</li><li><strong>Reward model performance maintenance</strong>: We monitor the reward model’s predictive performance daily, using the next day’s user clicks after the training phase as the ground truth to compare with the model’s output. Model’s predictive performance can change because our users’ preference may change and the content in Nextdoor can shift in the writing styles or topics.<br>For monitoring purposes, we collect the engagement performance of different subject lines in the following way. We created a “control” user bucket where we always send emails with the user-generated subject and a “always OpenAI API” bucket where we always send with the OpenAI API subject, regardless of the reward model’s output. From these two buckets, we know the ground-truth on which subject line was more engaging, and measure the reward model’s accuracy. If the accuracy goes down by 10+%, we retrain the reward model with new data.</li><li><strong>Retries with Fallback: </strong>Since OpenAI API may return an error due to the rate limit or transient issues, we added retries with<a href="https://tenacity.readthedocs.io/en/latest/"> exponential backoffs with Tenacity</a>. If we fail after a certain number of retries, we fallback to the user-generated subject.</li><li><strong>Controlling the length of output: </strong>We found that the Subject line generator would write a subject line longer than our desired length (10 words). This happened even if we specified the 10 word limit in the instruction and added examples. We post-processed the generator output by cutting the first 10 words from the generator’s output. We A/B tested different word limits and found that 10 is the optimal value.</li></ul><h3>Results</h3><p>We did A/B tests with different versions of the subject line generator, and with and without the reward model. For the generator, we tested the following options</p><ul><li>Writing with OpenAI API: We ask OpenAI API to “write an engaging subject line for a given post”. This was the first version we tested without much prompt engineering.</li><li>Extracting with OpenAI API: We ask OpenAI API to extract the most interesting part and provide 5 examples. We also add requirements in a numbered list such as “Do not insert or remove any word.”, “Do not change capitalization”, “If the first 10 words are interesting, use them as a subject line”. We tried 4 different versions of prompts and picked the best version by A/B test metrics.</li></ul><p>For the A/B test metrics, we primarily focus on Sessions. A session is an activity sequence made by the same user within a certain timeframe, and sessions quantify the number of unique user visits.</p><p>Table 3 shows the results on Session lift compared to the “control” bucket where we use user-generated subject lines. In addition to the session metrics, our final model (last row) increased Weekly Active Users by 0.4% and Ads revenue by 1%.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/955/1*60AFxqFwmZ39d8g2h5NQUw.png" /><figcaption>Table 3. Session lift compared to the user-generated subject lines from A/B tests. The final model (last row) achieved 1% lift in sessions.</figcaption></figure><p>Here is what we learned from A/B tests:</p><ul><li>Prompt engineering improves the performance but has a ceiling. After a few iterations, the A/B test metrics showed only marginal improvements, failing to beat the control.</li><li>Finding the “optimal” prompt is an elusive task, as the space of potential prompts is boundless, making it difficult to explore. Moreover, there is no established algorithmic or systematic method for enhancing prompts. Instead, the task relies on human judgment and intuition to update the prompt.</li><li>Reward model was the key factor in improving sessions.</li><li>Predicting popular content is challenging, as is the reward model’s task of forecasting popular subject lines, which currently achieves about 65% accuracy. Enhancing the reward model’s performance by leveraging real-time signals like the current engagement numbers for the subject can be an interesting future work.</li></ul><h3>Conclusions</h3><p>We developed a novel Generative AI system to increase user engagement by combining the reward model and prompt engineering. Our systems have engineering components for cost saving and monitoring. A/B tests showed that our systems can deliver more engaging subject lines than the user-generated subject lines.</p><p>There are many avenues for future work. First is to fine-tune the subject line generator. In this work, we used vanilla ChatGPT API as the generator. Instead, we can fine tune OpenAI API with the most engaging titles that the reward model identifies. For each post, we generate multiple subject lines and use the reward model to pick the winner. Then we use the winner subject to fine tune the subject line generator. This approach is called Reinforcement Learning by Rejection Sampling [1].</p><p>Second is to rescore the same post daily. Currently, we pick the best subject line with a reward model once and never rescore. However, as time goes on, we may be able to see which of the OpenAI API subject line or user-generated subject line is getting more engagement, and our reward model can predict more accurately. Third is to add personalization without significantly escalating computational costs.</p><h3>Acknowledgments</h3><p>The post was written by <a href="https://www.linkedin.com/in/jaewonyang/">Jaewon Yang</a> and <a href="https://www.linkedin.com/in/qi-he/">Qi He</a>.</p><p>This work was led by the Generative AI team with cross-org collaboration between Notification team and ML teams. We would like to give a shout out to all the contributors:</p><p><a href="https://www.linkedin.com/in/joyzengjy/">Jingying Zeng</a>, <a href="https://www.linkedin.com/in/malikwaleed/">Waleed Malik</a>, <a href="https://www.linkedin.com/in/xiaoyan2/">Xiao Yan</a>, <a href="https://www.linkedin.com/in/hao-ming-fu/">Hao-Ming Fu</a>, <a href="https://www.linkedin.com/in/carolyn-tran-4904759b/">Carolyn Tran</a>, <a href="https://www.linkedin.com/in/ssuresh2/">Sameer Suresh</a>, <a href="https://www.linkedin.com/in/annabgoncharova/">Anna Goncharova</a>, <a href="https://www.linkedin.com/in/richardhuang11/">Richard Huang</a>, <a href="https://www.linkedin.com/in/jaewonyang/">Jaewon Yang</a>, <a href="https://www.linkedin.com/in/qi-he/">Qi He</a></p><p>Please reach out to us if you are interested to learn more — we are hiring!</p><h3>References</h3><p>[1] Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models, Arxiv preprint, 2023</p><p>[2] Ji et al. Survey of Hallucination in Natural Language Generation, ACM Computing Surveys, 2022</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=50a402264f56" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/let-ai-entertain-you-increasing-user-engagement-with-generative-ai-and-rejection-sampling-50a402264f56">Let AI Entertain You: Increasing User Engagement with Generative AI and Rejection Sampling</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Pre-trained to Fine-tuned: Nextdoor’s Path to Effective Embedding Applications]]></title>
            <link>https://engblog.nextdoor.com/from-pre-trained-to-fine-tuned-nextdoors-path-to-effective-embedding-applications-3a13b56d91aa?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/3a13b56d91aa</guid>
            <category><![CDATA[recommendation-system]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[embedding]]></category>
            <dc:creator><![CDATA[Karthik Jayasurya]]></dc:creator>
            <pubDate>Thu, 07 Sep 2023 11:31:32 GMT</pubDate>
            <atom:updated>2023-09-07T17:13:28.959Z</atom:updated>
            <content:encoded><![CDATA[<h3>Background</h3><p>The majority of ML models at Nextdoor are typically driven by a large number of features that are primarily either continuous or discrete in nature. The personalized features usually stem from historical aggregations or real-time summarization of interaction features, typically captured through logged tracking events. However, representing content through deep understanding using information behind it (text/image) is crucial for modeling nuanced user signals and better personalizing complex user behavior across many of our products. In the rapidly evolving field of NLP, utilizing transformer models to perform representation learning effectively and efficiently has become increasingly important for user understanding and improving their product experience.</p><p>Towards that, we have built a lot of entity embedding models spanning entities such as posts, comments, users, search queries &amp; classifieds. We first leveraged deep understanding of content and used that to derive embeddings for meta entities like users based on their past interacted content. These powerful representations are found to be very crucial towards extracting meaningful features for some of the biggest ML ranking systems at Nextdoor such as notifications scoring and feed ranking. By making them readily available and building to scale, we can drive adoption of state-of-the-art reliably and put them in the hands of ML Engineers for rapidly building performant models across the company.</p><p>This blog primarily focuses on how we iterated on the development of embedding models, how they are featurized and served at large scale into various product applications as well as some of the challenges encountered during this process. We summarize the evolution of work across three sections. In section 1, the focus is to leverage state-of-the-art pre-trained models to rapidly evaluate the value of embeddings models as feature extractors. Section 2 describes how to fine-tune embeddings using unlabelled data for certain products, whereas Section 3 demonstrates the use of labeled data to fine-tune embeddings for better task prediction. This work is driven by the Knowledge Graph Team at Nextdoor, a horizontal team that works in close collaboration with product ML teams as well as the ML Platform team who owns the ML training and serving platform and the FeatureStore service powering ML models at Nextdoor.</p><h3>1. Leveraging Pre-trained models</h3><p>The first generation of embeddings are built from pre-trained language models using the Sentence-BERT paradigm (<a href="https://www.sbert.net/">https://www.sbert.net/</a>). SBERT is well-known to produce better embedding representations compared to original BERT models [1]. The main goal here is to rapidly experiment with embeddings as features and realize their value in the product as quickly as possible. The text from content entities, viz. Nextdoor posts &amp; comments, is extracted from post’s subject and body and comment text respectively, which is then fed into a multilingual text embedding model to derive respective entity embeddings for all countries Nextdoor operates in. For a given user, their historical interacted posts’ embeddings are weighted aggregated based on interaction type to inform user (interaction) embedding. Ex: Active interaction such as post creation/comment/click would have higher weight compared to a more passive interaction like impression. These signals are aggregated across both online (feed) and offline (emails) product surfaces to represent user embedding holistically and are updated daily for all users in the platform.</p><p>These features were found to be among the most important features for multiple ranking models and delivered significant performance lifts in key product OKR metrics across both notifications and feed when shipped in early 2022. The pre-trained models also served as a good proof-of-concept to build out reliable feature ingestion pipelines and monitoring systems identifying any potential feature drifts and disruptions. This helped form a robust playbook for deploying several next generation embedding features.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ov2I6idhPRXzy4im" /></figure><h3>2. Fine-tuned embeddings from unlabeled data</h3><p>The next generation of embeddings describes training custom models which are improvements over pre-trained versions by leveraging techniques of fine-tuning. The signals used to generate embeddings earlier come from user interactions across notifications and home feed products either directly or indirectly. In contrast, this section details a use case that makes use of unlabelled data to perform representation learning to improve user search experience.</p><p>Our neighbors use Nextdoor search to find useful local information by expressing intent explicitly. We tried to capture both long and short term intent to determine and serve user perennial (e.g.: home maintenance) as well as ephemeral needs (e.g.: lost &amp; found). Search queries — while being high intent in nature, are inherently short and noisy. A searcher might try multiple variations of a query successively in order to get their intent fulfilled as much as possible. Additionally, due to the nature of local search, relying on labeled feedback from search results may not fully capture user intent due to limited liquidity.</p><p>To fully capture user intent signals, we rely on a self-supervised training strategy to learn fine-tuned representations for any given query. Specifically, we first built an SBERT backed query embedding model that learns to embed search queries in lower dimensional space. Then, we aggregate embeddings from user queries across different time windows (weekly/monthly/quarterly periods) to generate multiple user (intent) embeddings. The same model also extracts the intent of a post to generate the corresponding post embeddings. The resulting user, post and query embeddings are transformed and featurized as described in the later section to improve the performance of the ranking models.</p><p>The query embedding model is originally built to drive contextual query expansion in Nextdoor search pipeline [2]. This sentence transformer model is trained on historical search queries in order to best learn query representations. We first collected search logs that consisted of sequences of search queries within a session across all searchers over a period of time. Then, they are pre-processed<strong> </strong>using traditional NLP methods like lemmatization, spell checking, deduplication etc. to form a clean corpus of tokens, which is composed of n-grams (n=1,2,3) and whole queries. To generate a training dataset, we created positive pairs of tokens occurring within a user search session and negative pairs randomly occurring across sessions. Contrastive learning with cosine similarity loss is used to train the underlying model.</p><p>For the query expansion use-case, this model drove better contextual search results by identifying related candidates improving recall. This helped not only improve key search metrics across content search and product search in For Sale &amp; Free but also reduced the rate of null queries significantly compared to prior word embedding models. We also leveraged <strong>HSNWlib </strong>[3]<strong>, </strong>an approximate nearest neighbors library to implement this deep learning based query expansion further improving expansion latencies by more than 10x. For notification &amp; feed use cases, intent features generated from transformations of post &amp; user embeddings helped achieve significant positive impact on our top line engagement metrics. Although features can only be computed for searchers and are of low coverage overall, this explicit signal is found to be very useful in improving the overall search experience.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*_JHrrjQ1aOcMHC82" /></figure><h3>3. Fine-tuned embeddings from labeled feedback</h3><p>In the next evolution of embeddings, we additionally leverage user feedback to fine-tune models further. The pre-trained entity embeddings have served us well over a year, but they are off-the-shelf models trained using public benchmark datasets. As such, their semantics are quite different in nature from the Nextdoor domain. Moreover, their high but fixed model dimensionality contributes to significant storage and serving costs, especially when user embeddings are updated for all Nextdoor neighbors daily. To address these, we built a two-tower framework to fine-tune embeddings with user feedback collected across Nextdoor surfaces while reducing<strong> </strong>dimensionality, customizing to our domain, and being cost effective.</p><p>The fine-tuned models are developed and trained in phases, incrementally adding complexity. In the first phase, the inputs to post and user towers are pre-trained embeddings, which are then transformed using multiple FC layers, reducing dimensionality at each step. The standard cross-entropy is used as a loss function to predict the task of notification clicks for a given user and post. To generate a training dataset, we sampled from random explore logs to reduce selection bias, the same process as that of the downstream ranking model. Once the model is fully trained, the last layer generates fine-tuned user and post representations.</p><p>These pytorch models are trained on millions of records using SageMaker GPU instances with varying hyperparameters, and the model with the best offline performance is chosen to generate &amp; store fine-tuned embeddings into FeatureStore. The earlier described playbook is followed to build and monitor offline and online feature pipelines. Serving these cached features to downstream models has shown promising lifts in all engagement metrics (CTR/sessions/contributions/DAU/WAU) while keeping guardrail metrics that measure harmful/hurtful content distribution across the platform neutral.</p><p>In the next phase, we fed the post tower directly with text extracted from the post entity, allowing us to fine-tune parameters of the SBERT model. The test AUC score is used as a benchmark to determine how many layers and transformer blocks to unfreeze for trying out different training schemes along with optimization of hyperparameters of typical DNN models. The best model also improved the user — post cosine similarity of fine-tuned embeddings by up to 16% when compared to respective pre-trained versions — an additional evaluation criteria of intrinsic quality of improvement in representations. It is also noteworthy that this quality improvement is achieved while reducing dimensionality by more than 10x!</p><p>In the most recent phase, we extended into multi-task learning (MTL) setup modeling both notification clicks and feed actions to jointly optimize learning of fine-tuned embeddings. Again, these objectives mimic downstream rankers exactly to make sure learnt embeddings directly optimize downstream tasks. MTL models have the added advantage of learning a single model across multiple product surfaces thereby reducing operational burden and maintenance costs, while leveraging knowledge transfer across shared tasks for better representations. The feed and notifications surfaces are highly related as clicking on email notifications lands directly into the pinned view of the post in the newsfeed. Additionally, most actions on home feed are used as features in notification ranker making these tasks very related<strong>.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*mMj49F_njQR6-4pA" /></figure><h3>Using embeddings in ML models</h3><p>As most of our downstream production models are tree based models, they don’t directly integrate with vector features like embeddings, like that of Deep neural networks. Therefore, we primarily use outputs from embedding models as feature extractors into downstream models. Specifically, we rely on transformations like cosine similarity &amp; dot products across these entity embeddings in order to generate meaningful affinity features. While transitioning into neural network systems is currently underway — these vector transformations provide a neat way to integrate embedding based features into existing models and enable rapid experimentation for assessing performance lift of new deep features.</p><p>We first create schema and declare feature groups corresponding to each embedding to host within our in-house FeatureStore. Then, content based embedding features are ingested into our Featurestore in near real-time using task worker jobs as they get created/updated. For users, the daily scheduled jobs in Airflow compute embedding aggregations based on pre-specified lookback windows, weighted across various interaction types, and are batch ingested into Featurestore. Once systems are set up to ingest all relevant embeddings with appropriate TTL, we then write logging code to compute and log the derived features such as cosine similarity and dot product between user &amp; post, user &amp; user. Specifically, in feed, these features would represent affinities between post vs viewer, viewer vs author and analogously between post vs recipient, sender vs recipient in notifications world. Similarly, we also compute affinities across comment entities to inform activity based ranking in newsfeed. The data obtained from feature logging is used to train downstream ML ranking models, to avoid online-offline skew, and the model with best offline performance lift with new features is promoted for online AB test evaluation and ramping further towards majority member experience.</p><h3>Challenges &amp; Future</h3><p>Multiple entity embeddings i.e user, post, comment, query etc have been successfully integrated into various product surfaces at Nextdoor at large scale. In the past, models based on comment embeddings helped foster and cultivate kinder conversations to improve platform vitality metrics [4]. More recently, contextual topic embeddings are also developed using BERTopic [5] to achieve coarser level personalization of content to neighbors, while informing us about prevalence of content categories and types across the platform. We are also experimenting with image embeddings using CLIP [6] to leverage image/video information behind content.</p><p>In addition, as an extension to labeled fine-tuning, we plan to further improve representations along two dimensions. One is by concatenating representations with additional features such as image embeddings and existing interaction features to leverage multimodal and dense signals. The other is to extend tasks to other surfaces such as ads, For Sale &amp; Free (marketplace) etc to make representations more holistic across products. Once downstream models are fully modernized to DNN-based methods, the embeddings can be integrated into the model directly without losing any information from computation of transformations.</p><p>As we build more and more embeddings capturing different signals, we also need to be mindful of additional incurred costs from new features. Some of the initial challenges of serving high dimensional vectors during inference are mitigated by performing embedding transformations directly within FeatureStore rather than passing embeddings across microservices minimizing network bandwidth and scaling costs. This worked well with tree based models, however in the future, serving embeddings directly with DNN models can add up costs. Caching and serving fine-tuned embeddings can help control dimensionality while incorporating domain specific knowledge. This allowed us to rapidly experiment and quickly evaluate ROI at a smaller scale, justifying overall costs. From an infra standpoint, we found that optimizing the payload format of embedding features as well as sequencing of calls to efficiently read/write from FeatureStore greatly reduces overall costs at full scale.</p><h3>Acknowledgments</h3><p>This work would not have been possible without close cooperation and collaboration with various ML product partners (Notifications/Feed/Search/Vitality) as well as significant support for ML platform and FeatureStore service from ML Platform team. I would like to take this opportunity to give a huge shoutout to all the dedicated Nextdoor folks from these teams behind this endeavor.</p><p>Nextdoor is building the largest Local Knowledge Graph (LKG) in the world. The local knowledge graph inherited in our neighborhoods is Nextdoor’s unique proprietary data that can be used to enable personalized neighborhood and neighbor experiences. The Knowledge Graph team is focused on understanding neighbors and content by creating standardized neighbor/content data using state-of-the-art ML methods.</p><p>Third-party large language models (LLMs) such as GPT and the corresponding dialogue applications like ChatGPT, which are built upon these language models, lack access to the specific local knowledge of Nextdoor. As a result, they are unable to offer location-based services to our users as we desire. It is crucial for us to develop in-house custom LLMs that leverage our unique local knowledge graphs. We are building our own large language models (LLMs), that are based on top of Nextdoor’s raw content and the structured knowledge graph to power multiple products.</p><p>Please reach out to us if you are interested to learn more — we are hiring!</p><h3>References</h3><ol><li>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</li><li><a href="https://engblog.nextdoor.com/modernizing-our-search-stack-6a56ab87db4e">https://engblog.nextdoor.com/modernizing-our-search-stack-6a56ab87db4e</a></li><li><a href="https://github.com/nmslib/hnswlib">https://github.com/nmslib/hnswlib</a></li><li><a href="https://engblog.nextdoor.com/using-predictive-technology-to-foster-constructive-conversations-4af437942bd4">https://engblog.nextdoor.com/using-predictive-technology-to-foster-constructive-conversations-4af437942bd4</a></li><li><a href="https://maartengr.github.io/BERTopic/api/bertopic.html">https://maartengr.github.io/BERTopic/api/bertopic.html</a></li><li><a href="https://openai.com/research/clip">https://openai.com/research/clip</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3a13b56d91aa" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/from-pre-trained-to-fine-tuned-nextdoors-path-to-effective-embedding-applications-3a13b56d91aa">From Pre-trained to Fine-tuned: Nextdoor’s Path to Effective Embedding Applications</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Securing Diversity in Cybersecurity]]></title>
            <link>https://engblog.nextdoor.com/securing-diversity-in-cybersecurity-6aa83dafb850?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/6aa83dafb850</guid>
            <category><![CDATA[security]]></category>
            <category><![CDATA[engienering]]></category>
            <category><![CDATA[culture]]></category>
            <category><![CDATA[diversity]]></category>
            <dc:creator><![CDATA[Kristen Beneduce]]></dc:creator>
            <pubDate>Tue, 02 May 2023 13:01:51 GMT</pubDate>
            <atom:updated>2024-04-17T19:10:22.286Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*yX5cAwZUEG6w9Om0o4oiUA.jpeg" /><figcaption>Panelists from Left to Right: Ronit Polak (Moderator), Kathy Wang* , Lea Kissner, Rupa Parameswaran, Olivia Rose, Jameeka Green Aaron <em>*Correction: Kathy Wang is the former, not current CISO of Discord</em></figcaption></figure><p>At Nextdoor we build technology that empowers resilient, safe, and kind neighborhoods all over the world. Securing a product that empowers global communities requires diverse and inclusive teams, reflective of the communities we support.</p><p>Yet hiring and retaining the diverse talent needed to achieve our purpose remains an industry challenge. The gap is particularly evident in the cybersecurity field where <a href="https://cybersecurityventures.com/wp-content/uploads/2022/09/Women-In-Cybersecurity-2022-Report-Final.pdf">25% of the workforce</a> and <a href="https://www.forrester.com/report/ciso-career-paths-3-0/RES178976">16% of CISOs</a> identify as female. According to the <a href="https://www.wicys.org/initiatives/wicys-state-of-inclusion/">WiCyS State of Inclusion report 2023,</a> women cite lack of respect and limited opportunities for growth in cybersecurity as top challenges accompanying lack of representation. We must keep working on it.</p><p>That is why Nextdoor welcomed the chance to celebrate diversity, alongside <a href="https://www.rsaconference.com/usa">RSAC 2023</a>, in Nextdoor HQ’s backyard this week and to partner with our neighborhood <a href="https://www.wicys.org/">Women in Cybersecurity</a> (WiCyS) <a href="https://www.wicyssiliconvalley.org/">Silicon Valley chapter</a>. We are committed to building a diverse and inclusive workplace, and we are proud to work with organizations like WiCyS, who share the same values.</p><p>Nextdoor’s CISO TC Niedzialkowski kicked off with a warm welcome. CEO Sarah Friar framed the discussion by sharing how she launched her career by building a network at her first RSA conference as an equity analyst for Security Software at Goldman Sachs. She emphasized that diverse teams bring a variety of perspectives and experiences to the table, which ultimately leads to better problem-solving and innovation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*imv_qXqUqUBEeRHNup98qQ.jpeg" /><figcaption>Left to Right: Tanvi Kolte Tiwari (WiCyS Silicon Valley Events Chair) introducing the panel, Attendees soaking into a fantastic intro by Sarah Friar (Nextdoor CEO) , TC Niedzialkowski (Nextdoor CISO) cheering on the panel</figcaption></figure><p>Moderator <a href="https://www.linkedin.com/in/ACoAAAA6so0BbeyEVdTG8Lv7tnfVYxyiy-uiWdU">Ronit Polak</a>, WiCyS Silicon Valley President, and CISOs <a href="https://www.linkedin.com/in/ACoAAAAsCksBsAKP8s638QJljg-PXRLOl6yCUyg">Kathy Wang</a> <a href="https://www.linkedin.com/in/ACoAAAKRXXkBXG49jOvyf2TPIREhAehbLCy0Scw">Lea Kissner</a> <a href="https://www.linkedin.com/in/ACoAAACqM7kBoF-d65re_RcUA2-KhO_z6CPV6U0">Rupa Parameswaran</a> <a href="https://www.linkedin.com/in/ACoAAAAUy9IBnklODwADsw1roSf9mHvHRyMpd1Y">Olivia Rose</a> <a href="https://www.linkedin.com/in/ACoAAAFtMXsB2SSQbASNt5_2IXcVVyqdGk-NO-4">Jameeka Green Aaron, CISSP</a> wowed the audience, covering everything from combating today’s top cyber threats including AI to imposter syndrome with incredible authenticity and humor. Closing us out <a href="https://www.linkedin.com/in/ACoAAAFtMXsB2SSQbASNt5_2IXcVVyqdGk-NO-4">Jameeka Green Aaron, CISSP</a> called on WiCyS members to see themselves in the panelists and to thrive because representation matters!</p><p>Attendees ranged from aspiring cybersecurity professionals to a few celebrity leaders and practitioners from across, cybersecurity industry, government, and academia.</p><p><a href="https://about.nextdoor.com/antiracism/">Learn more</a> about Nextdoor’s initiatives to foster a holistically inclusive platform, and <a href="https://about.nextdoor.com/careers/">visit our careers page</a> to see openings at Nextdoor.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QDKD8KuZZP9zbi9slH-Hig.png" /><figcaption>Attendees Enjoying the Panel and Nextdoor Space</figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6aa83dafb850" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/securing-diversity-in-cybersecurity-6aa83dafb850">Securing Diversity in Cybersecurity</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Catching Anomalies Early in Mobile App Releases]]></title>
            <link>https://engblog.nextdoor.com/catching-anomalies-early-in-mobile-app-releases-ac95adf9da81?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/ac95adf9da81</guid>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[mobile-apps]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[mobile-app-development]]></category>
            <dc:creator><![CDATA[Walt Leung]]></dc:creator>
            <pubDate>Wed, 11 Jan 2023 15:27:14 GMT</pubDate>
            <atom:updated>2023-01-11T15:27:14.517Z</atom:updated>
            <content:encoded><![CDATA[<h4><em>How Nextdoor catches mobile app release anomalies at 1% adoption</em></h4><p>At Nextdoor, our mobile applications on iOS and Android serve content to tens of millions of weekly active users. At this scale, we run a weekly release process for both iOS and Android, shipping hundreds of changes across multiple teams and dozens of mobile engineers.</p><p>Our team uses several observability processes and rollout strategies to keep these deployments safe and scalable. We most notably use phased rollouts to minimize the impact of a potentially bad release. Phased rollouts allow us to gradually increase the adoption of users for a new app version. For example, we can have a new app version be released to only 1% of users on the 1st day, 2% of users on the 2nd day, and so on. That way, if a new release were accidentally shipped with an uncaught regression, having it at 1% rollout means it affects fewer users, reduces its severity level, and gives us more time to react.</p><p>However, for many of our critical business metrics where a failure can sometimes be silent, most out-of-the-box observability approaches don’t work with phased rollouts. This is largely due to two problems:</p><ol><li>Observability typically happens at an aggregate level. For example, we look at app sessions or revenue on a daily basis, across all users for a platform.</li><li>The behavior of early adopters on an app version differs from the median behavior of all users. Most importantly, early adopters are more active, almost by definition, to be in an early rollout of the new app version.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*H3TvnakvfRDZVSjj" /></figure><p><em>At Nextdoor, Daily Users are more likely to adopt releases over Weekly Users, Weekly Users over Monthly Users, and so on.</em></p><p>For example, consider an app session regression on a hypothetical iOS version v1.234.5 released March 4. If we had unknowingly introduced a regression where we didn’t count an app session 5% of the time, at a 1% rollout, our aggregate impact would be expected to be roughly 0.05 x 0.01 = 0.05% of all iOS app sessions, which is practically impossible to detect (read: noise) with aggregate-level observability. Even worse, early app adopters skew more active, which means that maybe we should expect 0.06% of all sessions impacted. Or maybe 0.07% of all sessions impacted. In short, it’s hard to tell exactly what our aggregate impact should be.</p><p>However, when iOS release v1.234.5 reaches full rollout in a week, a 5% app session regression would be business critical. We can detect the app sessions drop once it reaches full rollout by looking at week-over-week or month-over-month metrics, but by that point, several days would have passed.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NKh107Qd6XbC9TfR" /></figure><p><em>Stacked graph. Top trendline shows our app sessions, which has a clear regression starting March 7 with a low point at March 14. Bottom trendline shows the release adoption over time due to phased rollouts.</em></p><p><strong>How can we detect these issues on day 1, at 1% rollout?</strong></p><p>A simple approach would be to normalize our business metrics to the total number of users on the release, and turn all metrics into relative metrics (e.g. on v1.234.5, app sessions per active user per app version). Unfortunately, as mentioned earlier, we can’t directly compare the app sessions from users who have adopted a release to those who haven’t as their underlying characteristics are too different.</p><p>What we’re trying to solve for these early adopters is: what is the difference between their actual app sessions after adoption compared with their hypothetical app sessions had they never adopted the release in the first place, or an unobserved counterfactual? In statistics, we can measure this through difference-in-differences analysis.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*y70c1GVEQpu4G5V_" /></figure><p><em>For iOS release v1.234.5, app sessions over time of users who adopted the new app release on March 4 (teal) vs app sessions over time of users who did not adopt (gray).</em></p><p>Difference-in-differences analysis is a simple causal inference method we can apply here to estimate this effect by accounting for the separate time varying effects of users that have and have not adopted a release:</p><ol><li>For users who adopted the release, calculate the difference in their app sessions three days before and three days after the release period. In this case, we observed a<strong> -0.02 decline</strong> in app sessions.</li><li>Do the same for users that have not adopted the release. In this case, a <strong>+0.20 increase</strong> in app sessions.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*4KE35DeHtV08YHOM" /></figure><p>Assuming trends would have otherwise remained constant (pre-trend assumption), we would have expected app sessions of release adopters to increase by<strong> +0.20</strong> like we observed with non-adopters. However, they instead <em>decreased</em> by<strong> -0.02</strong>. We calculate the difference in differences to estimate a comparison against an unobserved counterfactual:</p><blockquote><strong><em>-0.02–0.20 = -0.22 decrease in app sessions due to iOS release v1.234.5</em></strong></blockquote><p>In practice, we don’t just calculate this in aggregate. We first make sure that our two cohorts exhibit similar behavior pre-adoption (pre-trend assumption). This is a critical step to difference-in-differences analysis. With a sample size over hundreds of thousands of users, we can achieve high confidence in similar pre-trend behavior with a simple standard deviation bound over the preceding few days to adoption. If this behavior holds, we then fit a linear regression model that estimates the average effect of a release for any particular metric:</p><p><strong><em>y = β0 + β1* Time_Period + β2* Treated + β3*(Time_Period*Treated) + e</em></strong></p><p>In the case of v1.234.5, we can measure statistically significant negative effects across multiple app sessions metrics.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*iSb6LoW7N78q5X9W" /></figure><p><em>Average % lift of metrics we ran App Release Anomaly Detection on for v1.234.5</em></p><p>With this difference-in-differences approach, we are now able to flag the app sessions decline due to v1.234.5 on March 5th, <strong>10 days earlier</strong> than we normally would have been able to using week over week figures. We also mitigate the need to factor in external variables such as seasonality or day of week. This not only helps in diagnosing the source of the decline to a specific app release, it also isolates the regression to <strong>less than 1% of iOS users</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Odg2PLNhkrLvAL2F" /></figure><p><em>App Release Anomaly Detection allowed us to discover and fix the release regression at 1% rollout, before our aggregate observability even showed a drop.</em></p><p>App Release Anomaly Detection is one of the many tools we’ve built at Nextdoor to give us observability into our releases while iterating quickly. It is one of the foundational elements that allows us to deploy major app releases on a weekly cadence and have confidence in our stability. Operationalized, App Release Anomaly Detection has helped us prevent nearly all severe critical client-side regressions and gives us peace of mind to release bigger changes at a more rapid cadence.</p><p>If this type of cross-functional work between platform engineering and data science at scale interests you, we’re hiring! Check out our <a href="https://boards.greenhouse.io/embed/job_board?for=nextdoor&amp;b=https%3A%2F%2Fabout.nextdoor.com%2Fcareers%2F#51115">Careers page</a> for open opportunities across all our teams and functions.</p><p>Written by <a href="https://www.linkedin.com/in/waltleungwbl/">Walt Leung</a> and <a href="https://www.linkedin.com/in/shaneausleybutler/">Shane Butler</a>, with support from <a href="https://www.linkedin.com/in/hai-guan-6b58a7a/">Hai Guan</a>, <a href="https://www.linkedin.com/in/charissarentier/">Charissa Rentier</a>, <a href="https://www.linkedin.com/in/qi-he/">Qi He</a>, and <a href="https://www.linkedin.com/in/jdperlow/">Jonathan Perlow</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ac95adf9da81" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/catching-anomalies-early-in-mobile-app-releases-ac95adf9da81">Catching Anomalies Early in Mobile App Releases</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Typeahead Search at Nextdoor]]></title>
            <link>https://engblog.nextdoor.com/typeahead-search-at-nextdoor-1875e70c67e8?source=rss----5e54f11cdfdf---4</link>
            <guid isPermaLink="false">https://medium.com/p/1875e70c67e8</guid>
            <category><![CDATA[autocomplete]]></category>
            <category><![CDATA[system-design-project]]></category>
            <category><![CDATA[geohash]]></category>
            <category><![CDATA[engineering]]></category>
            <category><![CDATA[search]]></category>
            <dc:creator><![CDATA[Jerry Tian]]></dc:creator>
            <pubDate>Wed, 06 Jul 2022 19:49:06 GMT</pubDate>
            <atom:updated>2022-07-06T19:49:06.811Z</atom:updated>
            <content:encoded><![CDATA[<h3>Background</h3><p>In a thriving community, people are connected to their friends and local businesses. Nextdoor is the hyperlocal platform that mirrors these offline relationships. Every day, through active discussions on the platform, new relationships are formed and existing ones strengthened.</p><p>For example, a Nextdoor user can create a post like “I really like @<strong>XYZ cafe</strong>. @<strong>John</strong> is a hard working business owner and we should all support him by buying a cup of delicious latte!” Here, the post is created by <a href="https://techcrunch.com/2022/02/15/nextdoor-revamps-with-new-profiles-feed-and-more-community-building-features/?guccounter=1#:~:text=Neighborhood%20members%20will%20also%20be%20able%20to%20connect%20and%20%40mention%20one%20another%20in%20posts%20and%20comments%2C%20similar%20to%20other%20social%20networks%2C%20like%20Twitter%20or%20Facebook.">at-mentioning</a> (via the @ symbol) nearby businesses and users. From this post, users in the neighborhood can contribute by at-mentioning others to be part of the comment threads. As a result, John’s cafe thrives and acts as a neighborhood hub where new friends are made.</p><p>Every month, millions of these mentions are created in various discussions (including <a href="https://youtu.be/PYfT7d3ZSw4?t=801">lost dogs</a>!). In addition to posts and comments, a user can type into the search box and see, among other things, nearby users and businesses. All these features are powered by the same autocomplete service — a set of APIs to ingest data and handle typeahead search of different entity types (businesses, users, keywords etc) on Nextdoor.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C5UxSGX9xQlUeIYP3f78qg.png" /></figure><p>This post focuses on how we built a proximity-based typeahead service to power typeahead use cases at Nextdoor.</p><h3><strong>Proximity-Based Typeahead Search as a Service</strong></h3><p>Any good search experience can be boiled down to two core components:</p><ol><li>Relevance: Given a search query, whether the user sees relevant results or not. As a hyperlocal social network, relevancy is heavily weighted by geo proximity.</li></ol><p>2. Low latency. <a href="http://radar.oreilly.com/2009/07/velocity-making-your-site-fast.html">Google Search</a> found that</p><blockquote><em>a 400 millisecond delay resulted in a -0.59% change in searches/user. What’s more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior.</em></blockquote><p>For a good autocomplete experience, as users type, relevant results should show up instantaneously.</p><p>To meet the product requirements, we set out to build a service with the following design goals:</p><ol><li>Low latency. There are hundreds of millions of entities on Nextdoor. The search latency at the service level should be less than 50ms.</li><li>Horizontally scalable to meet future scaling needs (we scale by adding more nodes).</li><li>Extensible. Typeahead search is a foundational API that enables other product features, so it should be easy to add other types of entities in the future.</li><li>High throughput for writes. We want to be able to index hundreds of millions of entities in a matter of hours.</li><li>Ease of operation and maintenance. When we index records, we should not impact production traffic.</li></ol><h3><strong>Implementation</strong></h3><p>We landed on an in-memory-based solution that leverages geohash. At a high level, geohashing divides the earth into multiple zones based on latitude and longitude. It provides a good way to shard a large data set into buckets based on a <a href="https://h3geo.org/docs/core-library/restable/">zoomed-in level</a>. Entities in the same bucket are in close proximity.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/577/0*SwpBX3o9lg-atIFK" /></figure><p>We used Uber’s open source geohashing library called <a href="https://h3geo.org/">H3</a>.</p><p>For handling typeahead search, we decided to use <a href="https://redis.io/docs/manual/data-types/#sorted-sets">sorted sets</a>. This gives a set of benefits:</p><ol><li>In-memory storage gives us the best possible latency characteristics for handling typeahead search.</li><li>It is easy to maintain. We can rely on<a href="https://redis.io/topics/persistence"> redis persistence</a> without having to handle durability ourselves.</li><li>By following the Command Query Responsibility Segregation (<a href="https://martinfowler.com/bliki/CQRS.html">CQRS</a>) pattern, we are able to index hundreds of millions of entities in a matter of hours with no impact to the serving of production traffic. Ingestion is handled by redis primary nodes in the cluster, and updates are then replicated to the read-only nodes which handle the typeahead search. Replication lag is less than 10ms.</li></ol><p>With these two core pieces in place, we built a set of APIs that work together to handle all aspects of typeahead search:</p><p>* indexing_api (ingestion)</p><p>* typeahead_api (search)</p><p>* ranking_api (ranking by entity types)</p><h3><strong>Ingestion Path</strong></h3><p>Here is an example of the ingestion flow for businesses (Starbucks with id: 5, latitude: 47, and longitude -122):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/752/0*dOmpgMTb5QX3S94f" /></figure><p>For users, the typeahead search API works for both first- or last-name prefixes. Here is an example of the ingestion flow for users (Steve Jobs with id 4, latitude 47, and longitude -122):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/752/0*msxJfITfOraJZrkw" /></figure><h3><strong>Query Path</strong></h3><p>With the above structure in place, typeahead search retrieves results with a simple look-up using entity type, geohash key, and prefix. We then hydrate and rank the results before returning them to the client.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/821/0*6qzz9RRNnHBJqZlo" /></figure><h3><strong>What we have today</strong></h3><p>The service has been running since August 2021. Every month we are handling hundreds of millions of typeahead search requests, and millions of comments with at-mentions are created. The service level for P95 search latency is less than 30ms.</p><h3><strong>Future work: typeahead for 1+ degree connections</strong></h3><p>To handle typeahead search with 1+ degrees (friends of friends), we can</p><ol><li>Get a list of 1st degree connections.</li><li>For each user in the connection from step 1, get their connections.</li><li>Aggregate these connections and perform typeahead by prefix.</li></ol><p>To reduce the network round trip between the first and successive calls, we can leverage <a href="https://redis.io/docs/manual/programmability/eval-intro/">Lua</a> for edge computing.</p><h3>Acknowledgement</h3><p>It takes a team to move mountains! I would like to take the opportunity to give a shoutout to all the dedicated Nextdoor folks behind this endeavor:</p><p>Shivam Bhalla, Stephen Cheng, Yuki Mizuno, Rajesh Balasa, Siva Pandeti, Uzair Khan, Sharvil Parekh, Hung Dao, Josh Sibelman, Bojan Babic, Jane Wang, Sudhanshu Siddh, Omer Palaz, Kristy Duong, Tristan Eastburn, Paul Meng, Cory Dolphin, Andrew Munn, Tim Wong, Chintan Shah, Rahul Sureka, Madeline Neveaux, Murali Krishna Hosabettu Kamalesha, Glen Tona, and Avinash Chukka.</p><p>And by the way, we are <a href="https://about.nextdoor.com/careers/">hiring</a>!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1875e70c67e8" width="1" height="1" alt=""><hr><p><a href="https://engblog.nextdoor.com/typeahead-search-at-nextdoor-1875e70c67e8">Typeahead Search at Nextdoor</a> was originally published in <a href="https://engblog.nextdoor.com">Nextdoor Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>