researchcostbudgetproductionpatterns

Cost protection patterns для LLM-агентов — 2026 best practices

Real-world incidents driving the patterns

$500 charge на одном call — context-exploding query × 200 cycles
$400 в одном runaway workflow — supervisor loop без circuit breaker / finish condition
$6,531 AWS bill — AI agent сканировал hobby network в 2026
Agentic AI burns ~50× больше токенов чем chat workloads — cost-runaway protection теперь table-stakes

Источники: https://www.nexgismo.com/blog/ai-agent-budget-guards-stop-runaway-api-costs, https://leanopstech.com/blog/agentic-ai-cost-runaway-token-budget-2026/

LiteLLM budget features (2026)

LiteLLM Proxy стал de facto стандартом middleware для budget enforcement:

Per-key, per-team, per-team-member, per-user budgets — через virtual keys
Multiple concurrent budget windows (e.g. $10/day AND $100/month — отдельно)
Per-model budgets на keys (Enterprise tier)
TPM/RPM limits per key, per model, per team
Max parallel requests
Automatic fallback to zero-cost models при exhaust
Temporary budget increase (new in 2026)

Agent-specific:

max_iterations — per-session iteration cap
max_budget_per_session — per-session spend cap
Session tracking через x-litellm-trace-id header или metadata.session_id
Session-level TPM/RPM limits отдельно от agent-wide

Источники: https://docs.litellm.ai/docs/proxy/users, https://docs.litellm.ai/docs/a2a_iteration_budgets

LangGraph defensive controls

recursion_limit дефолтит к 25 steps; raises GraphRecursionError при превышении
2026 guidance: combining recursion_limit с token budget + semantic completion checks + time-based circuit breakers — single guards недостаточно

Источник: https://docs.langchain.com/oss/python/langgraph/errors/GRAPH_RECURSION_LIMIT

Recommended defensive patterns (2026 consensus)

Из multiple 2026 sources — production playbook многоуровневый:

Hard max iteration caps на уровне графа/агента
Token budget enforced в коде через pre-call check — НЕ в системном промпте (LLMs не могут reliably self-enforce)
Token-velocity circuit breaker — suspend execution когда spend-rate превышает threshold
Hard wall-clock timeouts per request + per session
Progress / semantic completion detection — detect "same tool, same args, repeated" loops
Budget pressure warnings injected в контекст до iteration exhaustion
- Пример (Utah harness): two-tier "CAUTION" (10 iters out) + "WARNING" (3 iters out) system messages, appended только в API copy (not persisted)
Graceful termination + structured failure return при budget exhaustion
Per-user/per-team prompt-injection rate limits на gateway layer

Gateway-pattern (TrueFoundry 3-layer)

Per-user rate limits
Per-team budgets
Per-provider circuit breakers

Источник: https://www.truefoundry.com/blog/rate-limiting-ai-agents-preventing-llm-api-exhaustion

Что у нас сделано в intel-collector

Сверяем с recommended patterns:

Pattern	У нас
Hard max iteration caps	❌ Нет (наш граф линейный без циклов — циклов нет в принципе)
Token budget enforced в коде через pre-call check	✅ В `cost.py:check_budget()` + `llm.py:_preflight()`
Token-velocity circuit breaker	❌ Нет (дневной cap есть, но не velocity)
Hard wall-clock timeouts	❌ Нет в коде, есть в systemd unit
Progress / semantic completion detection	n/a (линейный flow)
Budget pressure warnings в context	❌ Не реализовано (можно добавить — log warning когда >80% дневного budget)
Graceful termination + structured failure return	✅ `BudgetExceededError` — pipeline catches, не падает
Per-user rate limits на gateway	n/a (single user)
Persistent cost ledger	✅ `CostLedger` таблица в Postgres
Daily + monthly cap	✅ В config: `daily_budget_usd` + `monthly_budget_usd`

Что добавить в наш код

Priority 1 — Budget pressure warnings (legitimate gap)

Добавить в cost.py:

def budget_pressure_level() -> str:
    daily_spent = spent_today()
    if daily_spent > settings.daily_budget_usd * 0.9:
        return "critical"  # >90% used
    if daily_spent > settings.daily_budget_usd * 0.5:
        return "warning"   # >50% used
    return "ok"

Используется в graph для логирования / отправки Telegram-warning'а.

Priority 2 — Wall-clock timeout в коде

Сейчас timeout есть только в litellm.acompletion дефолтом (~600s). Добавить per-stage asyncio.wait_for(..., timeout=120).

Priority 3 — Velocity check

Добавить:

def spent_last_hour() -> float:
    ...

def velocity_check(planned_cost: float, max_per_hour: float = 0.20) -> None:
    """Stop если >$0.20/час burn rate (для нашего workload это >6x normal)."""
    last_hour = spent_last_hour() + planned_cost
    if last_hour > max_per_hour:
        raise BudgetExceededError(f"velocity exceeded: ${last_hour:.4f} in last hour")

Не критично сейчас

Наш граф линейный + один запуск в день + Gemini Flash дешёвый → реальный риск runaway очень низкий. Текущей защиты (hard daily cap + ledger + dry_run) достаточно. Priority 1-3 — incremental improvements когда дойдут руки.

Metadata

title: Cost protection patterns для LLM-агентов 2026
tags: ['research', 'cost', 'budget', 'production', 'patterns']
created: 2026-06-30
sources_fetched: 2026-06-30