An applied AI research & product lab

Agents that finish what they start.

Bombadil Labs is an independent lab working on long-horizon autonomy — systems that hold up over thousands of steps, and the instruments that prove it. We publish the math, build the products, and take a small number of serious engagements.

See the work Read the latest field note

Trusted by teams at

awsMicrosoft

What we do

Three practices, one thesis: long horizons are won, not waited for.

Research

Long-horizon reliability is arithmetic before it's engineering. We work out the math of agents that hold up — horizons, verifiers, failure economics — and publish the instruments next to the claims.

Agentic R&D

Warden, Loremaster, Mirrormere, Goldberry: products built end to end in the lab, run on our own harnesses, pointed at problems where autonomy compounds instead of merely impressing.

Consulting

A few embedded engagements at a time, for teams whose agentic systems have to work. Architecture, evaluation, and the judgment calls between them.

In the workshop

What we're building.

Four systems in motion, each a bet on the same thesis: the next order of magnitude in software comes from agents you can leave alone.

in development

Warden

The codebase custodian

An autonomous custodian for production repositories. Warden triages incoming issues, reproduces bugs in sandboxed environments, writes the regression test first, then opens evidence-backed pull requests — while your team sleeps.

Read the brief

research preview

Loremaster

A living review of everything you cannot afford to miss

An agentic research engine that reads the literature continuously, maintains living syntheses with claim-level provenance, and flags contradictions the moment they appear — so your team's understanding never goes stale.

Read the brief

design partners

Mirrormere

See the reliability you actually have

Every agent team is running a verifier — they just don't know its catch rate. Mirrormere replays production traces against independent checkers, measures the failures slipping through unseen, and prices what fixing that is worth in months of frontier progress.

Read the brief

research preview

Goldberry

Proof of conduct for economic agents

Two hundred thousand agents registered onchain this year; almost none of them can prove they do what they claim. Goldberry is the missing layer — policy-scoped execution, staked re-execution, and settlement that clears only when conduct does.

Read the brief

How we work

House rules, written in ink.

01
Long horizons or nothing.: Demos sprint; value runs marathons. We build for the thousandth step, where the interesting failures and the interesting products both live.
02
Every claim ships with its instrument.: If you can't check us, we haven't finished. The math in our field notes runs live on the page, against the same code we unit-test.
03
Buy nines where they're cheap.: Verifier nines cost an engineering sprint; actor nines cost a training run. Knowing which store to shop in is half the discipline.
04
Small, sharp tools.: Capability is subtraction done well. The best systems carry the fewest moving parts that still clear the bar.
05
Leave things better.: Codebases, datasets, teams — we hand back the keys in better shape than we found them.

Field notes

Dispatches from the lab notebook.

What we learn, written down while it's still sharp — with the instruments embedded, so you can check the math as you read.

June 12, 2026

The Verifier's Dividend

A checker that catches 90% of an agent's mistakes buys the same task horizon as ten months of frontier model progress — for under 1% extra compute. Four instruments and one falsifiable claim.

Building something with a long horizon?

Tell us what you want to leave running unattended. We read everything that isn't a template, and we reply to everything we read.