reading

Tobi Coker, Feyza Haskaraman11 min read

We're living in a multi-model world

How alternative models support a valuable multi-model stack, and where we think the next $10B+ company gets built.

June 15, 2026

11 min read

Tobi Coker

Feyza Haskaraman

June 15, 2026

Chart of AI frontier lab leaders across various scoring categories over time

For four years since ChatGPT’s release, the key question in AI has been about who wins the model wars. With the two big winners filing for what could be trillion dollar IPOs, that question has mostly been answered. Now, the key question for VCs and founders becomes how many models will be used, and what opportunities come from a multi-model world that sits alongside the leading duopoly.

Some early answers arrived in our recent AI Infrastructure Survey (opens in new tab), and they surprised us. Every team we surveyed runs more than one model in production. Sixty percent run more than four, and fifty percent expect to use even more models in the future. So, despite the revenue and funding concentration in frontier model layer that we've been living in for the past several years, there are signs of life for a multi-model future with significant growth just over the horizon.

In other words, two things can be true at the same time: First, the model layer is consolidating at the top. Second, the stack around it is expanding. This article is about the second truth: what to build, and where to invest.

What's driving the multi-model market

Companies aren't running multiple models because they enjoy the complexity. They're doing it for reasons that keep getting more pronounced, not less.

Cost has started to matter.

In our follow-on survey of AI-native engineering leaders, 70% expect token costs to become a critical budget item requiring active management within the next one to two years. Uber and other large companies are vocally curbing AI costs. The methods they plan to use aren't exotic: switching dynamically between models, running models locally or on owned infrastructure, leaning harder on open source, or even capping usage of premium models. Each one is a multi-model strategy by definition.

The cost pressure isn't theoretical. Cost per token already ranks as the largest single constraint on AI workloads in our infrastructure survey, ahead of GPU availability, latency, and model quality. Companies adopted AI to accelerate development and improve margins, and some are starting to notice they now spend more on tokens than they used to spend on the engineers the tokens augmented. The first window of AI adoption, where buyers paid for performance and ignored the bill, will eventually come under more scrutiny. The next one is about getting the cheapest sufficient model into each call.

Inference Spend Is Breaking Out Of Engineering Budgets

Open source caps the price of the closed-source ones.

A vibrant and quality open source ecosystem plays a big role. The OpenRouter/a16z State of AI report (opens in new tab) from December 2025 puts open-source models at roughly 30% of inference traffic, with Chinese labs (DeepSeek, Qwen, Kimi) doing most of the work. The bigger effect isn't the share itself. It's what the existence of credible open-source alternatives does to closed-source pricing. Frontier API prices will eventually have a soft ceiling: the cost of self-hosting an open-source model that's good enough. Across the new crop of inference clouds, open source alternatives are offered 10x less expensive than models at the frontier. Whether teams choose open-source or not, the competitive benefit will give them leverage that will show up some day in pricing negotiations with Anthropic or OpenAI. The US open source ecosystem lags China, but there are a host of domestic companies that will keep rooting for open source, from Nvidia, to inference clouds, to enterprises.

No single model is winning everything.

To see the dynamic competition and leapfrogging across all model categories, look no further than Arena’s (opens in new tab) benchmark leaderboards. Reasoning, code generation, multimodal grounding, and latency each tend to have a different leader at any given point, and the leaders change monthly. Picking one model for the whole stack means accepting losses on three or four dimensions to win on one. Enterprise teams won’t accept that trade, especially as use cases grow more complex. They pick by task. 26.3% of teams in our infrastructure survey describe managing multiple models in production as their single biggest infrastructure pain, which is itself the clearest evidence that they're doing it. (See Arena (opens in new tab) for the live leaderboard across 10+ distinct categories.)

Looking further out, 70% of the engineering leaders we surveyed expect to use the same number of models, or more, two years from now. Model strategy isn't trending toward consolidation; it's trending towards growth.

Persistent competition to the duopoly

It's tempting to read this as a transitional moment: the big two keep attracting capital and talent, the challengers fade, customers eventually settle on one primary provider for everything. We don't think that's where this ends. The reasons are mostly structural, not about model quality:

Capital is funding the long tail in a way the public-cloud era never saw.

Cloud overwhelmingly consolidated to three winners because the capex to build a hyperscaler was prohibitive, and the switching costs across clouds were high. There are some similarities in the AI cycle, but also key differences, such as lower capex alternatives and lower switching costs today. Arguably the biggest difference is the level of fundraising to fuel ongoing competition. Well-funded competitors like xAI, SSI, Mistral, Cursor, and newer labs such as Flapping Airplanes and Recursive Superintelligence exist at scales that would have built credible infrastructure companies a decade ago. Leaders in specific modalities, like Runway for Video, Black Forest Labs for image, and Eleven Labs for voice, add another layer of competition. Add Google’s Gemini and any future Big Tech attempts, and it’s enough for customers to feel they have somewhere else to go if Anthropic or OpenAI press their pricing leverage too hard.

The inference market needs more than two model companies to serve.

Inference providers like Fireworks, Baseten, Modal, Deep Infra, and Together built businesses on the premise that customers want to run different models in different regions at different costs and latencies. If the world ever consolidates to two labs that absorb their own inference, that whole layer disappears or weakens. Baseten has been in talks at an $11B valuation, Together at $7.5B, Fireworks at $15B, Fal at $8B, Modal at $4.65B and public neoclouds like Nebius ($60B+) show us investors are betting it doesn’t. This capital, along with the ecosystem’s key supplier (Nvidia) will be rooting for and fueling model alternatives.

Switching costs aren't what they were in any prior platform shift.

Migrating from AWS to GCP is a multi-year project. Switching from Anthropic to OpenAI for a specific workload is an afternoon. That asymmetry caps the pricing power the duopoly can exert, and it's already changing how teams build. Production stacks are being designed to swap models in and out, not to commit to one. The race is on for model providers to move up the stack and find ways to better lock customers in, but today they’re primarily aspiring to sell tokens and leave the harness, context, observability, and other layers to third parties (though competition on this layer of duopoly has been notably increasing as of late). Besides memory, today, switching costs are still low.

The multi-model stack: where value gets built

If the multi-model world is durable, the next question is where the value lands. Our view is that it lands at multiple layers. One internal frame we use for the architecture is what we've started calling the three-brain problem.

Brain 1: The Frontier

The polymath layer is well known: Anthropic and OpenAI are leading. Frontier models continue to improve at a steady clip, and 56.5% of teams in our infrastructure survey use closed frontier models exclusively for their most critical workloads. They're powerful, but they're incomplete on their own. A frontier model doesn't know what happened this morning, and it doesn't know anything about the company calling it until it's been given the context to do so.

Brain 2: Real-Time Knowledge

AI systems need access to the live web. Foundation labs are bundling some of this functionality in-house, but we’re also seeing increased preference for specialized retrieval and search infrastructure like Exa, Parallel, and Firecrawl. These specialized and agent-friendly models better bridge the frontier models with current reality. For some use cases, we expect the foundation lab leaders to capture this area, but the specialized search providers tend to win on cost, freshness, or latency, and we expect them to stay critical components of the agent stack.

Brain 3: The Enterprise Context

Every organization carries internal knowledge that's foreign to any foundation lab: documents, workflows, customer data, the institutional context that lives in nobody's training set. It also has employees generating agentic training data every day. This is the brain we think is most underestimated by investors and most strategically important to enterprises themselves.

Solutions for this category include the RLaaS labs (Thinking Machines, Applied Compute) competing to build customer-specific models, alongside the services and forward-deployed-engineer business models that have to sit beside the software to make any of it useful. The category is harder to scale than pure-play SaaS. For the same reason, it's harder to commoditize. Enterprises will want ownership over this layer to avoid lock-in and protect their own data, and we expect value to accrue most reliably to the companies that learn to automate business-process mining and context mapping at scale. A fine-tuned, improved, or customer model ultimately means more models in the overall mix.

These three brains rarely live inside a single model, and are the smallest possible permutation. Building differentiated enterprise AI systems requires using more models, picking the right one for each call, observing the result, and controlling the cost. Making this easier requires more from the layers above them.

Orchestration & Model Routing

We've written separately about our view on model routing (opens in new tab), so we'll keep this brief. The orchestration layer (routing, evaluation, observability, governance) is real and underbuilt, and it won't be the only place multi-model value gets captured. In our infrastructure survey, 47.4% of teams said they have agents in production as a core product feature, with no standard layer for durable state, failure recovery, or workflow replay. They're each building their own. Early entrants like OpenRouter, Braintrust, Onyx, and Runlayer are each taking pieces of the problem. None has stitched the full stack together yet.

Cost-optimized and edge models

It remains early for model distillation, edge models (Flower.dev), and localized inference, but these are all coming. Nvidia recently announced its RTX Spark product family, bringing AI compute performance to PCs, which adds to the tailwinds for personal and localized agents. As inference moves toward devices, edges, and private deployments, the stack of smaller and specialized models grows alongside the frontier rather than at its expense.

Applications

The applications that scaled fastest in this cycle built on top of multiple frontier models, not on one provider's API. Higgsfield's infra for video and image gen is one example.

Cursor is an even more instructive case study: it leaned heavily on third-party models early and is now selectively internalizing capabilities (Composer 2) as the economics warrant. We expect that pattern: multi-model for speed and selective internalization for margin, to become the default for the app layer.

There's a consumer arc here too. As privacy- and flexibility-conscious users start to matter, a "DuckDuckGo of chat" gets easier to imagine, and multi-model is the prerequisite for that product to exist at all.

Keep building

If you're an investor: OpenAI and Anthropic have moved past the venture window. They captured a big part of the foundational pie, but the overall pie is set to get much larger. From here, it’s about what gets built around them and what enterprises will actually need in order to optimize a multi-model world.

We see outsized returns at every layer of the stack: inference, orchestration, search & context brains, smaller models, and the applications that empower users to harness the power of many models.

If you're a founder: the multi-model world is your strategy. Building deep integration with a single frontier model is the bet that will look most dated five years from now. Building across them is what will scale.

The value concentration at the top of the model layer has discouraged a lot of would-be builders from working at any of these layers. We hope this thesis changes their minds. The next wave of value outside of the duopoly remains up for grabs.

Authors

Tobi Coker
Partner
Feyza Haskaraman
Partner