Skip to content
AI Native Builders

The $2,000 Engineer: Building a Token Budget Before AI Tooling Blows Your P&L

You approved GitHub Copilot. Then Claude Code. Now the invoice is a surprise and nobody knows who spent what. Here's how to build FinOps governance for AI token spend before your CFO asks.

Governance & AdoptionadvancedApr 6, 20266 min read
Editorial illustration of a bewildered CFO holding a leaking firehose while engineers hold mismatched buckets beneath it — token spend with no attribution, no ownership, no controlEveryone's drawing from the same hose. Nobody's measuring the buckets.
$500–$2K
Monthly API costs for engineers running Claude Code as an agent — per practitioner reports[^1]The high end applies to engineers running autonomous multi-step agents against large codebases
$150
Cost a single developer burned on a mid-size repo in 48 hours[^1]One developer, one project, two days. No budget gate. No alert.
38%
Engineering leaders spending $101–500/developer/year on AI tools — 10.5% already over $1K[^2]These are seat-license numbers. API costs are additional and often untracked
21%
Larger organizations with no formal AI cost-tracking system in place[^2]One in five large engineering orgs spending without visibility

The invoice arrives quarterly. Finance flags it. The VP of Engineering spends two days trying to reconstruct which teams, which agents, which workflows generated the spend. Nobody has the data. The CFO does not have a line item. The VP Eng does not have a dashboard.

AI token spend — the cost of LLM API calls across your engineering team — has become the engineering P&L problem that nobody was planning for when they handed out Claude Code licenses. Seat-based tools like GitHub Copilot and Cursor are the visible part: budgetable, predictable, easy to put in a spreadsheet. The API layer underneath is a different animal. It charges by token, scales non-linearly with agent autonomy, and produces no natural stopping point once an engineer discovers that running a swarm of parallel agents finishes the work in an hour.

The organizations that survived cloud bill shock in 2013 built FinOps practices: meter it, allocate it to teams, put visible budgets on business units, and build anomaly detection before the invoice arrives. That window is open again for AI token spend. The engineering teams that build the governance architecture now will have clean P&L attribution and controllable cost curves. The teams that wait will be explaining a surprise invoice every quarter.

The Window Is Open Again

AI token spend in 2026 is AWS in 2013 — the same spending patterns, the same missing governance, the same window to build FinOps practices before the bill becomes structural

In 2013, AWS billing was chaos at most engineering organizations. Individual teams spun up infrastructure without central visibility. Costs scaled with usage but the invoices arrived 30 days late. Finance had one line item: "cloud infrastructure." Nobody could tell which product, which team, or which architectural decision was responsible for a cost spike.

The response was FinOps: a discipline built around metering every resource, tagging it to a cost center, surfacing it in near-real-time, and building chargeback models so business units owned their cloud costs. By 2018, most mature engineering organizations had cost allocation tags on every AWS resource, per-team budget alerts, and anomaly detection that paged when a team's spend spiked unexpectedly.

AI token spend is in the 2013 moment. The State of FinOps 2026 report shows 98% of FinOps respondents now manage AI spend — up from 63% the previous year — and 58% have implemented showback or chargeback models for it[5]. But that's the leading edge of mature FinOps organizations. The median engineering org is still treating AI API costs as a homogeneous line item with no team-level attribution.

The FinOps Foundation's AI working group frames the problem precisely: "the unit economics of generative AI are fundamentally different from cloud infrastructure — variable by model, prompt complexity, and agent autonomy, not just by hours provisioned."[8] That variability is why the usual controls break. A monthly seat license is predictable. A per-token API cost with an autonomous agent that retries failed tasks is not.

Without token governance
  • One invoice line: 'AI tooling — $XX,XXX'

  • No team-level attribution

  • No budget by use case (copilot vs. agent vs. batch job)

  • Surprises surfaced at quarterly review

  • Engineers optimize for speed, not cost

  • CFO cannot connect spend to business outcomes

  • Model selection left to individual preference

With token governance
  • Per-team, per-use-case cost attribution in real-time

  • Monthly budgets enforced at the proxy layer

  • Automated alerts when daily spend exceeds 3x baseline

  • Model routing by task complexity (Haiku for autocomplete, Opus for agents)

  • Chargeback to business unit cost centers

  • ROI dashboard correlating token spend to shipped features

  • Finance has a line item they understand and can plan

Anatomy of Runaway Token Spend

Three patterns that blow engineering AI budgets — and why seat-license thinking completely misses them

Token spend explodes in three distinct patterns, each requiring different controls.

Agentic loops are the highest-cost pattern. An engineer uses Claude Code in autonomous mode to refactor a large codebase. The agent reads hundreds of files, generates code, runs tests, interprets failures, and retries. A session that takes two hours and produces solid output might consume $50–80 in API costs — nothing alarming. But the same engineer runs three sessions per day across four simultaneous tasks, and you have $200/day or over $4,000/month from a single person[1]. Multiply by a 40-person engineering team with a new agentic workflow mandate and the monthly number becomes structural.

Unoptimized model selection compounds the problem. Engineers default to the most capable model because it gives better results. Nobody is routing autocomplete calls to a lighter model when the org's API key gives access to Opus. The cost difference between Haiku and Opus for the same task can be 10–20x. When every call goes to the most expensive model regardless of the task, you are paying premium prices for work that a cheaper model handles adequately.

Tokenmaxxing is the behavioral pattern that emerges from invisible budgets. Practitioners have started using this term for engineers who maximize token consumption — running agent swarms, keeping large contexts loaded, retrying aggressively — because it's a rational career move when spend has no personal consequence[6]. If completing a task faster via heavier AI use improves your output, and there's no visible cost signal, the rational behavior is to spend as much as the task can absorb. Invisible budgets make tokenmaxxing structurally inevitable.

The Attribution Architecture

Route every LLM call through a cost-attribution proxy. This is the foundation that makes everything else possible.

The core of AI token governance is a proxy layer that sits between your engineers' tools and the model provider APIs. Every LLM call — from Claude Code, from your internal agents, from CI/CD pipelines — passes through this proxy. The proxy tags each call with team, use case, and outcome metadata, records the cost, and enforces budget limits.

LiteLLM is the most widely-deployed open source option. It implements a hierarchical multi-tenant architecture: organization → team → user → key[9]. Budgets cascade down the hierarchy, and every API call is tracked with the full attribution chain. Portkey offers a managed alternative with a governance layer built around workspaces, roles, and budget controls.

The diagram below shows the flow from engineer tooling to model provider, with cost attribution captured at the proxy layer and surfaced in a dashboard.

AI Token Cost Attribution Flow
Every LLM call passes through the attribution proxy, which tags, records, and enforces budget limits before routing to the model provider.
litellm_config.yaml
model_list:
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-haiku
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  # Track spend per team in real-time
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

general_settings:
  database_url: os.environ/DATABASE_URL
  master_key: os.environ/LITELLM_MASTER_KEY
  store_model_in_db: true

  # Default budget applied to every new team
  default_team_settings:
    max_budget: 500        # $500/month per team
    budget_duration: 30d
    tpm_limit: 2000000     # 2M tokens/minute ceiling

With the proxy running, create a virtual key per team and assign it a monthly budget. Teams embed this key in their tooling config — Claude Code, LangChain, custom agents — and every call routes through the proxy automatically[3].

The proxy records cost with the full attribution chain: which team, which user, which model, how many input and output tokens, and what metadata the caller passed. That metadata is where use case attribution lives — you tag calls with use_case: code-review or use_case: autonomous-agent to break down spend by workflow, not just by team.

create_team_budget.sh
# Create a team with a hard monthly budget
curl -X POST 'http://your-litellm-proxy:4000/team/new' \
  -H 'Authorization: Bearer $LITELLM_MASTER_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "team_alias": "payments-squad",
    "max_budget": 800,
    "budget_duration": "30d",
    "tpm_limit": 1000000,
    "metadata": {
      "cost_center": "ENGG-PAYMENTS",
      "team_lead": "sarah@company.com",
      "budget_owner": "vp-engineering"
    }
  }'

# Response includes the team's API key
# Engineers add this to their .env files:
# ANTHROPIC_API_KEY=sk-litellm-payments-squad-xxxx
# ANTHROPIC_BASE_URL=http://your-litellm-proxy:4000

Model Routing by Task Complexity

The biggest lever on token cost isn't budget limits — it's routing each task to the right model. A 10–20x cost difference exists between Haiku and Opus for the same request.

Budget enforcement prevents catastrophic overruns. Model routing reduces baseline spend. The two controls operate at different layers and compound.

The routing principle is straightforward: match model capability to task requirements. A code autocomplete in an IDE needs fast response and basic completion quality — Claude Haiku handles this well at a fraction of the Opus cost. A multi-step architecture review that needs deep reasoning over a complex codebase genuinely benefits from Opus. Routing every request to the same model because it's what the API key defaults to wastes money on the former and potentially shortchanges the latter.

Task TypeRecommended ModelRationaleApprox. Relative Cost
IDE autocomplete / inline suggestionClaude HaikuSpeed matters more than depth; context is small1x baseline
Code explanation / docstring generationClaude HaikuWell-defined, bounded task; little ambiguity1x baseline
Code review — single PRClaude SonnetNeeds judgment on patterns, security, style~5x baseline
Test generation for existing functionClaude SonnetModerate complexity; clear success criteria~5x baseline
Multi-file refactor with dependenciesClaude SonnetContext-heavy but structured; Sonnet sufficient~5x baseline
Architecture review / system designClaude OpusRequires deep reasoning over ambiguous tradeoffs~20x baseline
Autonomous multi-step agent (planning loop)Claude OpusAgent orchestration quality significantly affects outcome~20x baseline
Batch summarization / classification jobsClaude HaikuHigh volume, low complexity; cost savings compound1x baseline
routing_config.yaml
# LiteLLM router: task-complexity routing via metadata tags
router_settings:
  routing_strategy: usage-based-routing
  
  # Fallback chain if primary model is unavailable
  fallbacks:
    - {"claude-opus": ["claude-sonnet"]}
    - {"claude-sonnet": ["claude-haiku"]}

# Engineers tag calls with use_case metadata
# The proxy reads the tag and enforces model selection:
#
# client.messages.create(
#   model="claude-sonnet",  # requested model
#   metadata={"use_case": "code-review", "team": "payments-squad"},
#   ...
# )
#
# To enforce routing rules, add a LiteLLM hook:

litellm_settings:
  callbacks: ["my_routing_hook"]

# routing_hook.py — override model based on use_case tag
# from litellm.integrations.custom_logger import CustomLogger
# class RoutingHook(CustomLogger):
#   async def async_pre_call_hook(self, user_api_key_dict, cache, data, call_type):
#     use_case = data.get("metadata", {}).get("use_case", "")
#     if use_case in ("autocomplete", "inline-suggestion", "batch-classify"):
#       data["model"] = "claude-haiku"
#     elif use_case in ("code-review", "test-generation", "refactor"):
#       data["model"] = "claude-sonnet"
#     return data

Per-Team Budget Enforcement in Four Steps

The governance architecture is straightforward to stand up once the proxy is running. The organizational work of assigning budgets is harder than the technical implementation.

  1. 1

    Inventory all LLM call sources

    Before setting budgets, you need a complete map of where API calls originate. Engineering tools (Claude Code, Cursor, Copilot), internal agents, CI/CD pipelines, and application-layer features all need to route through the proxy. Any uncovered call source is a budget hole.

  2. 2

    Define your team and use-case taxonomy

    The attribution schema you choose now determines what questions you can answer in six months. Team attribution is the minimum viable structure. Use-case attribution (copilot vs. agent vs. batch) lets you have more nuanced conversations about value per dollar.

  3. 3

    Set budgets based on 60 days of baseline data

    Do not set budgets before you have baseline data. Run the proxy in observation mode for two months — log everything, enforce nothing. The baseline tells you what 'normal' spend looks like. Budget limits should be set at roughly 130–150% of the 60-day baseline, not at zero from a guess.

  4. 4

    Build the finance reporting export

    The technical governance is for the VP Engineering and platform team. The finance report is for the CFO and business unit leads. These are different audiences requiring different outputs. The proxy database has everything you need — the work is building the export and scheduling it.

Building the Anomaly Detection Layer

A budget limit stops runaway spend. Anomaly detection surfaces the cause before the limit is hit — and gives you time to investigate rather than just block.

Budget enforcement is a hard stop. Anomaly detection is an early warning system. You want both, because a hard stop at month-end budget tells you nothing about which workflow caused the spike, whereas an anomaly alert at 3x daily baseline gives you a live signal to investigate.

The anomaly detection model is simple: calculate the rolling 7-day average spend per team, compare today's spend to that average by 4 PM, and page the team lead when the ratio exceeds your threshold. A 3x spike on a Tuesday afternoon almost always has a specific cause — a new agent workflow a team shipped, a CI pipeline accidentally running agents on every push, or a developer manually running a large batch job.

LiteLLM exports spend data via its API and Prometheus metrics endpoint. Grafana with a simple Prometheus data source is sufficient for the alerting layer — no need for Datadog unless you already have it. The diagram below shows the flow from spend data to alert.

Anomaly Detection Alert Flow
Daily spend is compared to the rolling 7-day average at 4 PM. A 3x spike triggers a Slack alert to the team lead with model and use-case breakdown.
grafana_alert.yaml
# Grafana alerting rule — token spend spike detection
# Fires when a team's same-day spend exceeds 3x their 7-day daily average
apiVersion: 1
groups:
  - orgId: 1
    name: ai-token-governance
    folder: engineering-costs
    interval: 1h
    rules:
      - uid: token-spike-alert
        title: AI Token Spend Spike
        condition: C
        data:
          - refId: A
            # Today's cumulative spend per team
            queryType: range
            relativeTimeRange:
              from: 86400
              to: 0
            model:
              expr: sum(litellm_spend_usd_total) by (team_alias)
          - refId: B
            # 7-day rolling daily average per team
            queryType: range
            relativeTimeRange:
              from: 604800
              to: 86400
            model:
              expr: sum(litellm_spend_usd_total) by (team_alias) / 7
          - refId: C
            queryType: expression
            model:
              type: math
              expression: $A / $B   # ratio of today vs daily avg
        noDataState: NoData
        for: 30m
        annotations:
          summary: >-
            Token spike: {{ $labels.team_alias }} is at
            {{ $values.C | printf "%.1f" }}x daily average
        labels:
          severity: warning
        condition:
          evaluator:
            type: gt
            params: [3]   # alert at 3x daily baseline

The Four Numbers That Make Finance Stop Asking Questions

Finance does not need to understand tokens. They need a cost-per-outcome number and a trend line they can put in a board deck.

Cost per merged PR
Total AI token spend ÷ merged pull requests per team per month. Connects spend to shipped work.
Spend trend vs. output trend
Is cost-per-PR going up or down as the team scales AI use? Efficiency improvement is the ROI story.
Budget utilization by team
Which teams are at 40% of budget, which are at 95%? Budget allocation accuracy is the governance story.
Model mix by use case
What percentage of spend goes to each model tier? A high Opus % on autocomplete tasks is a routing problem, not a budget problem.

The critical insight for the CFO conversation: cost per merged PR is the number that makes AI spend legible to finance. If a team spent $3,200 last month on AI APIs and shipped 85 pull requests, their cost per PR is $37.65. If a comparable team spending $1,100 shipped 20 pull requests, their cost per PR is $55. The first team is getting more value per dollar, even though their absolute spend is higher. That framing converts the conversation from "AI is expensive" to "here's the ROI on your AI investment."

Start building this report before finance asks for it. The engineering org that walks into the quarterly business review with a cost-per-outcome dashboard owns the narrative. The engineering org that gets asked for this data scrambles for three days and produces a number nobody trusts.

The DX 2026 survey found that 86% of engineering leaders feel uncertain about which AI tools provide the most benefit, and 40% lack sufficient data to demonstrate ROI[2]. The token governance architecture solves both problems simultaneously — it gives you the cost data and, when correlated with output metrics, the ROI data.

Questions VPs and CFOs Ask

The objections that come up in every token governance conversation

Won't hard budget limits block engineers at critical moments?

Soft limits — alerts at 80% of budget — handle this. Hard limits at 150% are a circuit breaker for runaway processes, not a daily constraint for normal work. In practice, the engineers most likely to hit a hard limit are running unbounded agent loops that a soft limit would have caught hours earlier. Design the limit tiers correctly: green (0–80%), amber alert (80–110%), hard stop (150%). Almost all legitimate engineering work stays in the green.

What if teams need to run large batch jobs that temporarily spike spend?

Budget exemptions work exactly like cloud FinOps exemptions: the team lead requests a temporary limit increase for a specific time window and use case, the VP Engineering approves it, the proxy gets updated. This is one API call to LiteLLM. The process creates an audit trail — who requested it, why, what they ran — which is itself valuable data for future budget planning. The alternative (no limits, unlimited spend) is what you have now.

Do we need all this infrastructure if we only use seat-licensed tools?

If every engineer is on a fixed-seat GitHub Copilot plan and nothing touches the API layer, the seat license is a predictable line item and you don't need token governance. The moment any team starts using Claude Code, LangChain agents, or calling Anthropic/OpenAI APIs directly, you need the attribution proxy. The trend is strongly toward API-layer access as engineers move from copilots to autonomous agents. Build the proxy before the API footprint grows, not after.

How do we handle the cultural pushback from engineers who feel monitored?

Frame it correctly from the start: individual-level spend dashboards are for engineers' benefit — they can see if they're on track for the month before a hard limit hits. Team-level dashboards are for planning, not surveillance. The goal is not to catch engineers doing something wrong; it's to give every team a predictable budget they can plan around. Engineers working at well-run organizations with clear cloud budgets don't feel surveilled by CloudWatch cost alerts. Token budgets are no different when the framing is right.

We're already using Datadog for observability — do we need Grafana too?

No. LiteLLM exports spend data via Prometheus metrics, which Datadog can ingest directly through its Prometheus integration. Build your token spend dashboards in Datadog alongside your existing observability. The Grafana reference in the architecture is illustrative — any metrics visualization layer works. The proxy is the critical component, not the dashboard tool.

Token Governance Implementation Checklist

  • Inventoried all LLM API call sources across engineering tooling, agents, and CI/CD

  • Deployed LiteLLM (or Portkey) proxy with team-key authentication

  • Ran 60-day baseline observation before setting hard budget limits

  • Created team virtual keys with monthly budgets and cost center metadata

  • Implemented use_case tagging convention across all LLM call sites

  • Configured model routing hook (Haiku for autocomplete, Sonnet for reviews, Opus for agents)

  • Built Grafana/Datadog alert for 3x daily spend spike per team

  • Built weekly cost-per-PR report for finance and business unit leads

  • Defined budget exemption request process for legitimate spike workloads

  • Shared per-engineer spend dashboard so engineers have personal visibility

Hard Rules for AI Token Governance

Never set budget limits before collecting 60 days of baseline data

A budget set from intuition will either block legitimate work or provide no real constraint. Baseline data gives you the right number. Two months of observation costs almost nothing; a misset limit that blocks a critical deployment at 4 AM costs trust and incident time.

Route every LLM call through the attribution proxy — no exceptions

One team with a hardcoded API key bypassing the proxy breaks the entire attribution model. You cannot have partial coverage; a spend spike from an untracked source is indistinguishable from a tracking failure. Treat unproxied API keys the same way you treat unapproved cloud credentials.

Cost-per-PR is the CFO metric — not total spend, not tokens consumed

Total spend increases as the team does more. That's not a problem — it's growth. The metric that matters to finance is efficiency: are you getting more output per dollar over time? Cost-per-PR captures this. Present only this metric in QBRs unless finance asks for more detail.

Model routing is a governance control, not an engineering suggestion

If model selection is left to individual preference, engineers default to the best model for every task regardless of cost. Routing rules embedded in the proxy are non-negotiable guardrails. Document the routing policy, explain the reasoning, and enforce it programmatically. A policy doc nobody reads does not change model selection behavior.

Key terms in this piece
AI token budgetLLM cost governancetoken spend attributionAI FinOps engineeringLiteLLM team budgetsAI tooling P&L
Sources
  1. [1]Engineers Should Spend $250K on AI Tokens — Mid-Size Repo Hit $150 in 48 Hours (Medium, 2026)(medium.com)
  2. [2]How Are Engineering Leaders Approaching 2026 AI Tooling Budgets? (DX, 2026)(getdx.com)
  3. [3]Setting Team Budgets — LiteLLM Documentation(docs.litellm.ai)
  4. [4]Spend Tracking — LiteLLM Documentation(docs.litellm.ai)
  5. [5]State of FinOps 2026 Report — FinOps Foundation(data.finops.org)
  6. [6]Tokenmaxxing: The Costly Mistake in AI Engineering Metrics (2026)(itsmeduncan.com)
  7. [7]How Token-Based AI Coding Tools Impact Engineering Budgets (Exceeds AI)(blog.exceeds.ai)
  8. [8]FinOps for AI Overview — FinOps Foundation(finops.org)
  9. [9]Multi-Tenant Architecture with LiteLLM — LiteLLM Documentation(docs.litellm.ai)
Share this article