Cost Anatomy of 1,127 Agent Runs: Where the Money Actually Goes

Everyone has opinions about AI agent costs. We wanted data.

We built 5 agent workflows, instrumented them with AgentMeter, and ran them 1,127 times across Claude, GPT-4o, and Gemini in a controlled test environment. We tracked every LLM call, every tool invocation, every retry, every token. These aren't production logs — they're repeatable benchmarks with realistic prompts, real API calls, and real tool integrations. The aggregated findings are below. The raw dataset of all 1,127 runs is linked at the bottom for you to validate.

The headline numbers

	Total
Runs tracked	1,127
Total spend	$4,281.39
Mean cost per run	$3.80
Median (p50) cost	$1.22
p95 cost	$22.14
p99 cost	$61.87
Most expensive single run	$187.43

That p95/p50 ratio — 18x — is the number that matters. It means your "average cost per task" is a lie. The long tail eats your budget.

The five workflows

We chose workflows that represent real agent use cases — tasks with high variance, unstructured inputs, and multi-step reasoning where deterministic code can't substitute for an LLM. Nobody needs an agent to parse a log with a regex. These are workflows where they're actually necessary:

Workflow	Description	Runs	LLM calls/run (median)	Tools/run (median)
Support resolution	Classify, research KB, draft reply, refine, send	312	5	3
Code review	Read PR diff, analyze, comment per file, summarize	243	12	6
Research report	Multi-source web research, synthesize, cite	198	18	11
Data pipeline debug	Read logs, hypothesize, query DB, fix, verify	187	15	8
Content generation	Outline, draft, self-critique, revise, format	187	8	2

Finding 1: The median is boring. The tail is where money burns.

Here's the cost distribution for each workflow:

Workflow	p50	p75	p95	p95/p50 ratio
Support resolution	$0.48	$1.10	$4.87	10x
Code review	$1.85	$3.20	$18.90	10x
Research report	$3.40	$8.70	$42.60	13x
Data pipeline debug	$2.10	$5.80	$31.40	15x
Content generation	$0.62	$1.15	$3.80	6x

Content generation has the lowest variance (6x) because the execution path is mostly fixed — always the same number of steps. Research and pipeline debugging have the highest variance (13-15x) because the agent decides how many sources to check or how many hypotheses to test.

The takeaway: Workflows with open-ended tool loops produce the fattest cost tails. If your agent can decide "I need to search one more time," it will — and sometimes it'll search 40 times.

Finding 2: Context accumulation is 52% of total spend

We tagged every token to track where money goes. Across all 1,127 runs:

Cost category	% of total spend	$
Context re-reads	52.1%	$2,230
New input tokens (instructions + tool results)	21.3%	$912
Output tokens	15.8%	$677
Tool call fees (MCP + APIs)	7.4%	$317
Retries (wasted spend)	3.4%	$145

More than half of every dollar went to the LLM re-reading context it had already seen. This is the quadratic cost curve in action: step 1 sends 2K tokens, step 5 sends 14K tokens, step 10 sends 30K+ tokens. Each step re-processes everything before it.

Even with Anthropic's prompt caching (90% discount on cache hits), cache reads were the single largest line item. Cheaper per token, yes — but the volume is enormous. In our research workflow, the median run processed 847K total input tokens across all calls. Of those, 680K were cached re-reads.

Finding 3: The most expensive step isn't the one you'd guess

For each workflow, we ranked steps by total cost contribution:

Support resolution — most expensive step: "Refine response" (step 4), not "Generate draft" (step 3)

Step	Avg cost	% of workflow
Classify ticket	$0.003	0.6%
Search KB	$0.08	17%
Generate draft	$0.12	25%
Refine response	$0.18	38%
Summarize & send	$0.09	19%

The refinement step is expensive because it receives the largest context (everything before it, including the full draft) and produces a substantial output. It's the step most teams would skip in optimization — "it's just a polish pass" — but it's the largest cost center.

Research report — most expensive step: "Source evaluation" (step 3 of N), not the final synthesis

The research workflow dynamically decides how many sources to check. In expensive runs (p95+), the agent evaluated 12-15 sources before synthesizing. Each evaluation was cheap ($0.40-0.80), but the loop ran enough times that this step collectively accounted for 64% of total workflow cost in the p95 bucket.

Finding 4: Tool costs are small but unpredictable

Tool and API fees were only 7.4% of total spend on average. But the distribution is bimodal:

In 73% of runs: Tool costs were under 5% of total — trivial.
In 8% of runs: Tool costs exceeded 30% of total — dominant.

What causes the spike? Retry cascades. When a tool call fails (timeout, rate limit, malformed response), the agent retries — but each retry adds context tokens (the error message, the retry prompt), making the next LLM call more expensive and triggering another tool call. A $0.02 tool call that fails 3 times doesn't cost $0.08. It costs $0.08 in tool fees plus $0.40+ in accumulated LLM context from the error-handling loop.

The 12 most expensive runs in our dataset (all over $50) shared one pattern: a tool call failure that triggered a retry loop of 5+ attempts before the agent gave up or succeeded. Retry handling is the #1 predictor of cost blowups.

Finding 5: Model choice is a 40x lever — but only at the workflow level

We ran the same support resolution workflow across three models:

Model	Median cost	p95 cost	Success rate
Claude Sonnet 4.6	$0.48	$4.87	94%
GPT-4o	$0.31	$3.12	91%
Gemini 2.5 Flash	$0.04	$0.38	87%

Gemini Flash is 12x cheaper at the median and 13x cheaper at p95. But the success rate drops from 94% to 87%. Those 7% of failed runs still cost money — and often trigger retries or human escalation, which costs more than the original task.

The real optimization isn't "use the cheapest model everywhere." It's "use the right model per step." Our best-performing configuration used Gemini Flash for classification and summarization, Sonnet for analysis and generation. Total cost: $0.19 median — 60% cheaper than all-Sonnet, with a 93% success rate.

Finding 6: Three runs accounted for 8.3% of total spend

This is the long-tail problem in practice. Out of 1,127 runs, three individual runs cost $187.43, $94.21, and $72.56 — totaling $354.20, or 8.3% of our entire dataset's spend.

All three were research workflows that entered open-ended exploration loops. The $187.43 run made 67 LLM calls and 43 tool invocations over the course of evaluating 23 sources. The agent kept finding "one more relevant source" and never hit a stopping condition.

Without per-run cost tracking, these would be invisible — buried in an API dashboard showing "total spend this week: $X."

The five highest-ROI optimizations from our data

Ranked by impact based on 1,127 runs:

1. Set per-run budget caps. The single highest-impact change. A $10 cap on our research workflow would have saved $354.20 (the three runaway runs) with zero impact on the other 1,124 runs. This isn't optimization — it's insurance. Most agent frameworks support this — a hard dollar limit per run that throws an exception when exceeded.

2. Manage context aggressively. 52% of spend was context re-reads. Summarize intermediate results instead of passing raw context forward. Reset the conversation when switching phases. Use sub-agents for exploratory loops so the main context stays lean.

3. Route models per step, not per workflow. The 60% cost reduction from mixed-model routing (Finding 5) is the easiest optimization that doesn't sacrifice quality. Classification and summarization don't need frontier models.

4. Cap tool retries at 2. Every retry cascade in our data that exceeded 3 attempts eventually failed anyway. The agent rarely "figures it out" on retry 4. Fail fast, escalate, and save the compounding context costs.

5. Track per-task costs, not monthly aggregates. The $187 run was invisible in our weekly spend total. Per-task attribution is how you find these — and how you know if your optimizations are actually working.

The raw data

We're publishing the anonymized cost data from these 1,127 runs. Every run includes: workflow name, model, step-by-step cost breakdown, token counts, tool calls, retry count, success/failure, total cost, and duration.

We're finalizing the dataset for release.

These numbers come from our own testing environment using AgentMeter. Your costs will vary based on your models, prompts, tools, and architecture. But the patterns — the quadratic context growth, the 10-18x p95/p50 ratio, the retry cascade problem — are structural. They'll show up in any agentic workflow that uses tool loops and multi-step reasoning.

If you want to see these numbers for your own agents: AgentMeter is an open-source SDK that tracks per-task costs across LLM calls, tool fees, and API charges. npm install @grislabs/agentmeter and you'll have this data in minutes.

All costs calculated from provider pricing as of March 2026. Runs executed between March 15-24, 2026. Full methodology and dataset available on request.