Running ML Inference Services in Shared Hosting Environments

Danny Luo
Nextdoor Engineering
2 min readOct 29, 2021

--

Our ML Inference Dashboard

The CoreML team had the pleasure of presenting ‘Running ML Inference Services in Shared Hosting Environments’ at MLOps: Machine Learning in Production Bay Area Virtual Conference. This presentation was based off the 6 years of experience the Nextdoor CoreML team has productionalizing and operating 30+ real-time ML microservices.

Abstract

Running a ML inference layer in a shared hosting environment (ECS, K8s, etc.) comes with a number of unobvious pitfalls that have significant impact on latency and throughput. In this talk, we describe how Nextdoor’s ML team experienced these issues, discovered their sources and fixed them, and in the end received latency drops of a factor of 4, throughput increases of 3x and improved resource utilization (CPU 10% -> 50%) while maintaining performance. The main points of concern are request queue management and OpenMP parameter tuning.

What You’ll Learn

  1. Why your load balancing algorithm matters
  2. The importance of request queue timeouts for service recovery
  3. What resources are actually being shared in a shared hosting environment

Speaker Bio

Danny is a Machine Learning Engineer on the Nextdoor CoreML team in San Francisco. He builds ML infrastructure for a variety of production ML use cases at Nextdoor, such as feed, notifications and ads. Danny previously worked in the ML space in Toronto, and has given talks at the Toronto Machine Learning Micro-Summit Series, Toronto Deep Learning Series, and Apache Spark meetup. He holds a B.Sc in Mathematics & Physics from the University of Toronto. Learn more at dluo.me.

Machine Learning at Nextdoor

Let us know if you have any questions or comments about Nextdoor’s Machine Learning efforts. And, if you’re interested in solving challenging problems like this, come join us! We are hiring Machine Learning engineers. Visit our careers page to learn more!

--

--