January 6, 2026

LMArena: Benchmarking and Advancing The Progress of All AI Models

Every week, AI models get more capable. In just a few years, we’ve gone from systems that followed narrow, predefined rules to models that can read, reason, solve problems, write code, and interact with software. Multiple labs are pushing the frontier simultaneously including OpenAI, Google, Anthropic, xAI, and an accelerating open-source ecosystem.

With development at this velocity, two questions emerge: 

How do we measure and advance the progress and performance of all models? And how do we benchmark performance across models on real-world use cases?

As models proliferate and capabilities diverge, the world needs a clear and continuous way to understand which models are improving, how they compare, and how they behave on practical tasks. 

Enter LMArena: the human-driven benchmark and data source to help advance all AI models.

LMArena began as an open research project (Chatbot Arena) built by UC Berkeley researchers in 2023. It has since grown into the industry’s standard evaluation platform, powered by millions of real users. Today, more than 5 million people across 150 countries cast over 4 million model comparisons each month, generating a uniquely robust signal of how leading AI systems perform in practice.

The premise is simple and powerful: Users submit a prompt → two anonymous model responses appear → users vote for the stronger answer and give feedback.

Aggregated across millions of human judgments, LMArena produces live leaderboards and deep performance insights that labs rely on to map progress across text, code, image, video, and search. The benchmark is a real-time, human-in-the-loop measurement of intelligence as it evolves.

While LMArena is best known for its leaderboard, its impact goes far deeper. It provides direct, actionable evaluation for model developers by surfacing where models excel, where they fail, and what real users expect.

Offline benchmarks miss this feedback loop entirely, because it emerges from people actually using the models, not predefined tests. As models become more general-purpose, their performance must be assessed in the complexity of real interactions across languages, domains, and user goals. LMArena converts that global variability into a coherent picture of progress.

And unlike conventional data labeling, these evaluations come from organic usage (that is, people asking real questions, comparing real outputs, and revealing real differences in capability). Labs treat these signals as essential infrastructure for understanding how their models perform relative to the state of the art.

Of note, LMArena’s consumer platform is not secondary; it is foundational to the business. As CEO Anastasios Angelopoulos noted, “Real-world AI evaluation must be transparent, rigorous and shaped by both objective data and human judgment.” 

Consumers use LMArena because it offers something they cannot get anywhere else: free access to the newest and most powerful models, often before broad release, and the ability to experience and evaluate frontier AI directly. The model labs see them as essential infrastructure, gaining high-quality real-world evaluation data. And as a result, LMArena has become the backbone of how the world understands and advances AI.

This dynamic is what enables LMArena to generate millions of human judgments that fuel its insights and, increasingly, its revenue. In September, LMArena launched its first evaluation product, an offering now used by the world’s leading model developers. The company is on track to exceed $25M in revenue by year-end, with demand coming from labs and enterprises that rely on external evaluation to guide high-stakes deployments.

When we invested in LMArena’s seed round earlier this year, the company was already emerging as the reference point for how labs understand model performance. Today, enterprises, consumers, and model labs all see LMArena as the most reliable, independent, real-world signal of capability. As AI becomes foundational to business operations and daily life, this signal only grows more indispensable. 

We’re proud to lead the series A in LMArena, and have Felicis General Partner Peter Deng join the board as an observer. We are thrilled to support the team as they scale the platform, deepen their scientific work, and build the next generation of consumer experiences that will bring millions more humans into the evaluation loop.