BREAKING: Harvey Co-Founder & Head of Applied Research on the Token Reckoning
$11B, 13 Trillion Tokens, + "LAB" Legal Agent Benchmark
The Token Reckoning is Coming..
Valued at $11 Billion, Harvey is on a mission to win the entire legal category, competing head-on against the trillion-dollar labs.
And they’re well on their way. This month, Harvey has passed $300M ARR, 960 employees, 2,000 customers, and roughly 13 trillion tokens processed. They've raised $1.2B to date from Sequoia, Kleiner Perkins, GV, Coatue, Elad Gil, the OpenAI Startup Fund, and GIC, with Sequoia and GIC co-leading the most recent $200M round at $11B.
→ Listen on X, Spotify, YouTube, Apple
In this episode, Gabe Pereyra, Co-Founder and President + Niko Grupen is Harvey’s Head of Applied Research, sit down with me in Harvey’s San Francisco HQ speakeasy to cover:
What LAB measures & why Harvey gave the rubric to its biggest competitors
Early leaderboard results + what they show about long-horizon legal agents
Token economics now hitting application-layer AI
Why Harvey is the largest embeddings consumer for some of the labs
Multi-model strategy & why conflict risk forces every law firm multi-model
Architectural shift from chat-based products to cloud agents
LAB methodology
The billable hour coming back.. this time for AI tokens
What’s next for the open-source legal research community
Coding agents hit Karpathy’s “agents work now” inflection in late 2025. Gabe argues legal is hitting its version of that curve right now.
Just a few weeks ago, Harvey open-sourced LAB, the Legal Agent Benchmark, the first open-source benchmark for measuring AI agent performance on real-world legal tasks. It covers 1,200+ tasks across 24 practice areas with 75,000+ rubric criteria, with initial leaderboard results from OpenAI, Anthropic, & DeepMind released today.
Before Harvey, Gabe was an AI researcher at Google Brain, DeepMind, and Meta, working on deep learning in 2016-2017 as the field was taking off. Niko was previously at Google Brain as well. Together, they led the open-sourcing of LAB.
This is the second episode in Sourcery’s Harvey mini-series, following the last conversation with co-founder and CEO Winston Weinberg.
𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒
(00:00) Gabe Pereyra (Co-Founder) & Niko Grupen (Head of Applied Research)
(00:50) Inside Harvey's legal agent Benchmark
(05:10) What happens after Benchmarking?
(06:37) Why Harvey open sourced its research
(09:21) Training models without client data
(10:32) Google Brain vs. DeepMind
(12:34) From Researcher to Founder
(15:15) The Rise of the Inference Layer
(18:38) The Agentic Shift
(21:16) Harvey's 13 trillion tokens
(23:48) AI's Biggest cost misconception
(28:37) How Top AI founders learn
(31:52) Learnings from Jensen Huang
(34:14) How Harvey finds talent
(35:41) Niko on Harvey's breakthroughs
(36:38) Building a legal dataset from scratch
Brought to you by:
Brex—The intelligent finance platform: cards, expenses, travel, bill pay, banking—wrapped into a high-performance stack. Built for scale. Trusted by teams that move fast. visit → brex.com/sourcery
Turing—Turing partners with frontier AI labs to improve model capabilities in coding, reasoning, tool use, & multimodality, as well as with Fortune 500 enterprises to build & deploy end-to-end agentic AI systems in mission-critical workflows Visit: turing.com/sourcery
VCX—VCX is the public ticker for private tech, allowing investors of all sizes to invest in venture capital. View The Portfolio at GetVCX.com
Deel—Deel is the global people platform that helps startups hire, manage, pay, and equip anyone, anywhere. Trusted by more than 35,000 fast-growing companies, Deel is the people platform that just works, so teams can scale without the chaos. Visit: deel.com/sourcery
Public-–Investing platform Public just launched Generated Assets, which lets you turn any idea into an investable index with AI. With Generated Assets, you can build, backtest, refine, and invest in any thesis with AI. Gone are the days of one-size-fits-all ETFs. Try it today: public.com/sourcery
Merge—The leading provider of customer-facing integrations and agentic tools for frontier LLMs, Fortune 500 organizations, and B2B SaaS companies. Visit: https://merge.dev
Inside Harvey’s Research Bet:
$11B, 13 Trillion Tokens, Legal Agent Benchmark
In this episode, I sat down with Harvey co-founder & President Gabe Pereyra + Head of Applied Research Niko Grupen at the company’s San Francisco HQ. Just a few weeks ago they open-sourced LAB, the Legal Agent Benchmark, the first open-source benchmark for measuring AI agent performance on real-world legal tasks.
Below is an 8-part breakdown of the LAB launch, the token economics now hitting application-layer AI, & the research DNA shaping Harvey’s next phase.
→ Listen on X, Spotify, YouTube, Apple
Harvey is Compounding Fast.
Now valued at $11B after a $200M round co-led by Sequoia and GIC in March, with more than $1.2B raised to date from Sequoia, Kleiner Perkins, GV, Coatue, Elad Gil, and the OpenAI Startup Fund
ARR passed $300M this month, up from $100M last August
Token usage will hit roughly 13 trillion this month, up from 1 trillion in January
The LAB launch is the first major signal that Harvey is pushing into a different phase: one where the company is publicly competing as a research org, not just an application layer. “I think for some of the labs, we are the largest consumer of embeddings,” Gabe told me. The token volume is now significant enough that Harvey is a meaningful customer to every major model provider, and a public participant in the open-source research community.
Both interviews took place in Harvey’s Fancy AF* San Francisco speakeasy. Harvey Co-Founder, Gabe, runs product and research. Niko, who joined Harvey nearly three years ago, runs the applied research team that built LAB.
*Yes, Harvey is a Brex customer :)
What LAB Actually Measures
LAB is Harvey’s second public benchmark, after BigLaw Bench in 2024. The key difference is scope. “BigLaw Bench, which was more of a chat-based benchmark, which was kind of a QA dataset,” Gabe explained. LAB tests long-horizon agentic work end-to-end and mirrors how legal work actually happens at law firms. The benchmark covers 1,200+ tasks across 24 practice areas, with 75,000+ expert-written rubric criteria.
The methodology is borrowed from coding benchmarks. “We tried to really mimic what the coding, um, agent benchmarks did,” Gabe said, pointing to SWE-bench and Terminal Bench as templates. A coding task starts with a GitHub repo, an issue, and a set of unit tests. LAB applies the same logic to legal work: a data room, a partner request, and a set of legal unit tests that grade the resulting work product.
Gabe ran through a concrete example. For a diligence task, the agent receives the full data room of contracts for a target company and a partner request, typically just a sentence or two: “Hey, we got the data room. Can you look at it?” The agent has to infer the task, generate a diligence memo, and pass tests on specific items.
One example: change of control provisions across vendor contracts, where most are routine but a small number can materially affect the deal.
“I think there’s a bit of a misconception that, oh, legal is subjective, so you can’t do this. And I think the thing for especially BigLaw is a lot of the work you can actually quantify.”
Why Harvey Open-Sourced the Benchmark
The decision to open-source the benchmark was not obvious. Harvey’s biggest competitive risk on paper is that the labs build the same product in-house, and giving them the rubric to compete on is a counterintuitive move. “There’s obviously some risk of, oh, what if some lab provider’s models are the best?” Gabe acknowledged. The reason Harvey did it anyway is structural.
Law firms cannot rely on a single model provider. The reason is conflict risk. “Imagine you’re using only Anthropic’s models as a law firm and you wanna represent OpenAI. OpenAI is not gonna let you send their sensitive legal data to Anthropic’s models,” Gabe said. The same logic runs in reverse for any firm representing Google, Microsoft, or any other major lab. Multi-model is a hard requirement, not a preference, and Harvey’s value proposition is the routing and orchestration layer that sits above it.
Gabe’s analogy is cloud computing. “We think of the providers similar to cloud,” he said, citing Snowflake, Databricks, and Datadog as application companies that compete with the cloud they’re built on. Harvey’s strategy is to open-source the general legal capability layer and partner with every provider to push it forward, then build proprietary infrastructure for law firms to train models on their own client data inside Harvey’s product.
“We wanna open source the general stuff and work with all of the providers to make these models as good as possible at general legal, and then we want to build infrastructure for these law firms and enterprises that help them own their own models and build their own systems on their unique data.”
LAB is available on GitHub.
The Token Economics Inflection
The cost of running legal agents at production scale is now significant enough that Gabe is publicly flagging it as a category problem. A single assistant-type query at Harvey can cost $20. A contract review covering 100,000 documents can cost $20,000.
“We have queries that have simple assistant-type queries where you say, ‘Draft me a document,’ that a single query can cost $20. We have a review product where you can upload 100,000 contracts and ask the models to review them, and some of those can cost $20,000.”
Harvey’s aggregate token consumption tells the same story at a different scale. The company processed roughly 1 trillion tokens in January and is on pace for 13 trillion this month. Gabe said Harvey is the largest embeddings consumer for some of the labs. “If we kind of look at our usage plots right now.. this is just the start,” he said, pointing to power users as the leading indicator for what average usage looks like as long-horizon agentic workflows go mainstream.
The response is model routing and vertical model investment. Harvey is increasingly serving traffic from open-source models, particularly post-trained variants tuned for specific legal tasks.
“A lot of these very large frontier models are large because they’re good at everything. And so I think a lot of the opportunity for these specific verticals will be, ‘okay, I probably don’t need a trillion parameters if I just need the system to be good at diligence.’ ”
LAB is the measurement layer that makes the routing decision legible.
The Billable Hour for Tokens
The most original thesis Gabe offered was on the future of AI consumption pricing. The standard VC framing is that AI companies should price the work, not the tokens. Gabe thinks both extremes will break in the same way that fixed-fee legal pricing broke decades ago.
“I don’t think people realize how expensive this is going to get, and I don’t think people realize how difficult it is going to be for customers to deal with that.”
The analogy he drew is direct. Law firms tried fixed fees and could not make them work at scale because every engagement is different. The billable hour, which breaks legal bills into 6-minute increments, exists because it lets the industry price complex work at scale in a way clients can audit.
“It lets you price incredibly complex work at massive scale in a way that the entire industry can agree on, right? Because all these law firms we’ve talked to about pricing changes.. whenever we talk to them about fixed fee, they’re just like, ‘We have 10,000 clients, we can’t negotiate every engagement & price this, & everyone’s different.’”
The same dynamic, Gabe argues, is now coming for AI consumption pricing. He pointed to the Uber CTO publicly disclosing that the company burned through a year of coding tokens in three months. “All these customers are gonna start getting these consumption bills of like $10 million. And they’re gonna be like, ‘What did my agent do that cost me $10 billion?’ ” The accountability mechanism customers will eventually demand looks like the legal six-minute increment: per-token visibility, with auditing and routing layers built on top. Vertical companies, in Gabe’s view, are positioned to provide that layer because they can show ROI per token per task in a way horizontal players cannot.
From Chat to Cloud Agents
The single biggest infrastructure shift inside Harvey over the last six months was the move from chat-based products to cloud agents. “This was like a huge transition we’ve gone through in like the past 6 months,” Gabe said. Harvey’s original product was a chat-based copilot for lawyers, built out with the workflow plumbing law firms needed to use it on client matters. The cloud agent shift required rebuilding much of the infrastructure underneath.
“Maybe 6 months ago when these coding models started getting really good & you could do kind of things like Claude Code + Codex, and that started working really well in the command line, we started building a bunch of the infrastructure of how do you get these agents that were running in your command line & could execute all these tools & work really well, and how to move that into the cloud.”
That work required new sandboxing infrastructure (Harvey’s internal version is called Spectre), new approaches to per-matter ethical walls, and a step-change in token consumption per query.
The user-side inflection happened in parallel. For two years, coding agents were good enough that most programmers adopted them organically. Legal had not hit that point.
“I think we’re just starting to see in the past 6 months that inflection for legal, where the models can now generate entire documents, and they’re starting to work in a way where most lawyers who aren’t using this technology just for fun, they’re like, ‘This needs to be so much better than the way that I’m used to doing things for me to change my routine to do it,’ & we’re starting to see that absorption.”
That absorption is what’s pushing Harvey’s token usage into double-digit trillions.
Gabe’s Research DNA
Gabe came to Harvey from Google Brain, DeepMind, and Meta, working on deep learning at Brain and DeepMind during the 2016-2017 stretch when the transformer architecture was being developed. The two labs ran fundamentally different research strategies.
“Brain, the approach was let’s get all these smart people, give them a bunch of compute, and then kind of let them do their own projects,” he said, noting that this is the environment Noam Shazeer & Ashish Vaswani were in when transformers were invented.
DeepMind, in contrast, was top-down. “Demis just had this vision of, ‘Okay, we’re gonna create AGI. Here’s all the things that I think are required,’ and so they kinda had this tech tree.”
Gabe says Harvey’s research approach is closer to DeepMind’s. “Because we’re in a vertical, the end goal is very clear. We kind of know here’s all the legal work that needs to get done.” The applied nature of the problem gives Harvey a tech tree of its own: every legal practice area, every sub-task, every grading rubric a partner would apply to an associate’s work product. LAB is a formalization of that tree.
Gabe’s role inside Harvey has shifted twice. When the company started, he was running research while the rest of the team was scrambling to build an enterprise SaaS business. “When we started the company, I was more focused on research than we should’ve been at that stage of the company.” Harvey then pulled back from research to build the enterprise product. With infrastructure now caught up, the research-led phase has returned. “Now I’m full-time. And a lot of it is data labeling and how do we create good data sets? How do we work with all these partners?”
Gabe said he and Winston always had a two-phase plan: a seat-based SaaS business in phase one, and a consumption-priced AI business in phase two as models matured.
Niko on the Methodology & What’s Next
Niko Grupen runs the applied research team that built LAB. The data generation methodology is novel and worth understanding in detail. Legal data is too sensitive to source from law firms directly, so Harvey’s team generated synthetic documents using coding agents and had Big Law attorneys review them. “Coding agents, these agentic systems are actually so good at generating synthetic data now that even for non-public documents like certain contract types, et cetera, they can generate a pretty good first draft,” Niko said. The applied legal research team, all former Big Law attorneys across practice areas, mapped the 24 areas, wrote the rubrics, and reviewed every output before it went into the benchmark.
The headline Niko is watching is not aggregate model performance. It’s cost-adjusted performance. “People aren’t really thinking about performance in terms of just, like, quality maxing anymore,” he said. The frame has shifted to quality per dollar and quality per second.
LAB makes those tradeoffs explicit: Claude Opus 4.7 currently leads the leaderboard, but Gemini Flash variants complete work roughly 7x faster than some of the frontier models, and post-trained open-weight models are starting to close the gap with closed frontier models at a fraction of the cost.
The agent harness, the scaffolding around the model that defines its tools and skills, is where Niko sees the most active research happening. “The agent harness is like the buzziest topic right now.” Harvey’s bet is that specialization beats raw capability at the application layer. “The thing that we’re seeing over and over again Is that specialization matters and domain expertise matters.”
Niko’s own prediction: LAB gets saturated within a year, and the next layer of competition shifts from individual model performance to “intelligence at an organizational level,” the layer at which lawyers, agents, and other agents collaborate inside one institution.
LAB is open-sourced and available on GitHub.
→ Listen on X, Spotify, YouTube, Apple
Want more Harvey?
Harvey is HIRING
Listen to Ep 1 with Harvey CEO
→ X, Spotify, YouTube, Apple
The material presented on Molly O’Shea’s website are my opinions only and are provided for informational purposes and should not be construed as investment advice. It is not a recommendation of, or an offer to sell or solicitation of an offer to buy, any particular security, strategy, or investment product. Any analysis or discussion of investments, sectors or the market generally are based on current information, including from public sources, that I consider reliable, but I do not represent that any research or the information provided is accurate or complete, and it should not be relied on as such. My views and opinions expressed in any website content are current at the time of publication and are subject to change. Past performance is not indicative of future results.
Paid Endorsement. Brokerage services by Open to the Public Investing Inc, member FINRA & SIPC. Advisory services by Public Advisors LLC, SEC-registered adviser. Crypto trading provided by Zero Hash LLC, licensed by the NYSDFS. Generated Assets is an interactive analysis tool by Public Advisors. Output is for informational purposes only and is not an investment recommendation or advice. See disclosures at public.com/disclosures/ga. Matched funds must remain in your account for at least 5 years. Match rate and other terms are subject to change at any time.
























