August 7, 2023

A Shift in ML Deployment

ML infrastructure has a problem: tools that can work for a given task often aren’t designed for that task. Chris Lattner, co-inventor of LLVM and CEO of Modular, calls this the “hand-me-down problem” of ML. He, and many others, are trying to solve it.

A good example of the hand-me-down problem is deploying a PyTorch model to production using Flask. The PyTorch Tutorials page notes, “Using Flask…is by far the easiest way to start serving your PyTorch models, but it will not work for a use case with high performance requirements.” For that, you need to convert the PyTorch model from Python to C++ using TorchScript. Or maybe convert it to ONNX format then run it with ONNX Runtime… Why is crossing the chasm from development to production so hard? Shouldn’t there be a tool that’s designed to solve this task in a simple manner?

The ML toolchain has had makeovers before, but this time it’s different. Owing to the rapid and widespread adoption of LLMs, a new set of purpose-built tools have emerged that solve discrete tasks in the ML lifecycle. Of particular interest at Felicis are tools focused on ML deployment and serving. This essentially boils down to connecting GPUs to trained models in a timely, scalable, cost-effective manner. Despite the apparent simplicity, multiple, distinct pathways to achieving the same end have emerged. Is this the hand-me-down problem all over again, or will a winner emerge that finally redefines the ML deployment category?

In this post I’ll explore each pathway at a high level and discuss three factors that brought us to this point. If you’re building a startup that reimagines the post-development journey in ML, please reach out!

First, it’s important to clarify the differences between ML deployment, serving, hosting, and inference, terms that, in a broad sense, relate to the process of delivering predictions from a model to an end user. If you’re an ML veteran, skip this paragraph. Deployment refers to the overall process of packaging a trained model so it can be used in production. It’s often discussed in tandem with development, the process of making an ML model. Hosting involves publishing a trained model to the cloud, and serving is the process of responding to prediction requests from users, often as an API endpoint. Inference is the actual process of generating predictions from input data. Understanding these nuances will make the following paragraphs clearer.

Here’s how we segment the market. We’ve segmented the modern ML deployment and serving landscape into a few categories: decentralized GPU networks, serverless GPU platforms, GPU clouds, inference acceleration platforms, and model repositories with hosting as a service. On top of this, we see general-purpose ML development platforms, often built around popular ML frameworks like Pytorch or Ludwig, offering model hosting, and even LLM agent platforms monetizing through hosting. And, to add another dimension, users who don’t want to worry about model development or deployment can use a foundation model API, and those who worry a lot about controlling the compute resources on which their models run can buy GPUs or specialized AI hardware. The sheer number of categories, let alone options within each category, can feel disorienting. We mapped the “core” inference compute categories below, with “alternate” inference compute categories mapped below that.

Note: this is a non-exhaustive work-in-progress. Some companies don’t fit neatly into one category, but we chose the category that best matched the positioning inferred from their website.

Understanding which of the above approaches best serves your needs as a buyer is critical. For most use cases at the pre-Series A stage, working with a foundation model API suffices. We think the associated cost is akin to cloud storage costs, and, pre-product-market-fit, is preferable to training a model from scratch (the upfront CapEx and time burden of training a model is harder to justify when customer ROI is hazy). In the case of companies with access to large amounts of proprietary, clean data or a distribution advantage, training a model from scratch can make a lot of sense, particularly as OS LLMs continue to improve (see Llama 2 benchmarks). These companies can turn to platforms like Predibase, which bring development and deployment under one roof, or they can use a framework like PyTorch and connect to a GPU Cloud. GPU Cloud providers aren’t a bad option because they offer flexibility and control over compute resources, but they can sometimes be difficult to set up. Serverless GPU platforms like Modal are attractive for buyers who want simplicity and ease of management (see 2023 State of Serverless GPUs). Model repositories like Replicate are good options for experimenting with public models or seamlessly pairing your own model with a GPU Cloud instance. Replicate also provides an interactive GUI and an HTTP API for your model, features that have driven nearly ~100M Stable Diffusion model runs.

For some buyers, cost is the biggest consideration. There’s been significant attention paid to decentralized GPU providers like Together and Foundry in recent months. These companies presumably connect a network of idle GPUs from, say, gaming rigs to the rising tide of demand for ML compute. Doing so may lead to a lower $/hour GPU cost than the hyperscalers and GPU cloud providers. For customers that are latency, performance, and cost sensitive, turning to inference acceleration platforms like Exafunction could be the right path. And for sophisticated buyers with high access to capital, buying your own AI hardware from the likes of Nvidia or Tenstorrent could be the right move. An example of this is Inflection recently building a supercomputer equipped with 22,000 NVIDIA H100 GPUs. One interesting implication: in a GPU shortage, more GPUs could translate to faster time to market or more performant models. It follows that buying up GPUs or partnering with a hyperscaler gives companies like OpenAI and Mistral AI an edge over their peers. 

I believe three factors have pulled ML deployment and serving into the limelight this year: 

  1. Rising inference costs amid a GPU shortage
  2. “Barbelled” ML budgets driving startups to the extremes of ML development
  3. A renewed focus on DevEx without sacrificing model performance

The first factor is rising inference costs. One estimate from SemiAnalysis puts the inference cost of running ChatGPT at $700K per day. For smaller startups, serving predictions at this scale would be untenable. And while GPU FLOP/s per dollar is improving at a rate of 2× every 2-3 years, skyrocketing demand for AI products outstrips this scaling law. This dynamic—well covered here—is the first factor that I believe has catalyzed so much activity in the ML deployment and serving category. Demand exceeds supply, driving the price of compute up, which in turn brings startups into the space to (1) make money by renting GPUs, and (2) drive prices back down through innovation (like fractional GPUs) and competition. We’re not quite at step (2), but that’s what I think will happen.

Source: Marius Hobbhahn and Tamay Besiroglu (2022), "Trends in GPU price-performance"

The second factor is the barbell shape of ML budgets. We frequently hear from enterprises that they have well-defined budgets for the beginning and end of ML projects; everything in the middle is harder to justify. This means there’s budget for data labeling, which has played no small part in the success of companies like Surge AI, Labelbox, and Scale, and there’s budget for serving, which happens at the end of the ML lifecycle. Despite much activity in the LLMOps category, I think the barbell effect isn’t going away. ML companies need to make money, and renting GPUs is a pretty good way to do it (see CoreWeave’s 2024 projections). Again, the long-term margin implications of this aren’t great, but we’ll save that topic for another post.

The third factor relates to developer experience. Reducing complexity is paramount as more software engineers become ML engineers, and more models cross the development-production chasm. In a recent podcast, Chris Lattner said, “the bitter enemy that we’re fighting against in the industry…is complexity.” I couldn’t agree more. Moving forward, developers won’t want to convert their PyTorch model from Python to C++ or ONNX. They’ll want a push-button solution that does it for them. That’s what companies like Modal, Replicate, Predibase, and Foundry do really well. They take a previously complicated concept and bring a fresh simplicity to it without sacrificing model performance. DevEx used to come at the expense of model performance. We think the company that redefines ML deployment will achieve both.

We think the hand-me-down problem in ML is coming to an end. At least in the ML deployment and serving category, where real enterprise budgets exist, it feels like a shift has taken place. It’s a shift from complexity to simplicity, siloed to shared, fragmented to unified. It’s too soon to tell which pathway—serverless GPUs, GPU Cloud, decentralized GPU networks—will yield a multi-billion-dollar outcome in the broader deployment category, but we believe it will happen. The opportunity is too large and market timing too perfect. At Felicis we’re committed to investing in category-defining companies led by founders who question the status quo. If that sounds like you, and you’re passionate about reinventing ML deployment, please reach out.