AI Delivery Systems

Tiered Retry Is a Failure Economy

Diagnose failure before paying to retry

Insight / Published May 13, 2026

The expensive part of agentic coding is not the model. It is paying, again and again, for failure no one has diagnosed.

When an autonomous coding agent fails, the next step reveals the system. Most systems do the same thing: catch the error and try again. Same model. Same context. Same prompt. If that fails, they try again. Then they escalate to a bigger model, or quit.

Failure economy

This is full price, paid repeatedly, for the same unknown defect.

It may survive small loops. It collapses on real work.

Abracapocus takes the opposite view. Failure is not a signal to rerun the prompt. It is evidence to classify, route, and repair.

Retry without diagnosis

The naive loop retries blindly because it knows nothing else to do. It does not know why the task failed, so it cannot change course. It drives the same attempt into the same wall, now with the failed attempt in context, anchoring the next one to what already failed.

Codex’s /goals loop is the clearest example. It retries, but it has no typed diagnosis, no clean context, no bounded cost, and no acceptance gate beyond the model’s own judgment.

Retrying is not the weakness. Blindness is.

Retry without diagnosis is not resilience. It is hope, billed by the token.

Failure is not uniform

The naive loop treats every failure alike because it cannot tell them apart. But agent failures differ. Each has its own cause, and each needs its own repair.

Intent failure means the agent misunderstood the task and built the wrong thing correctly. The fix is a clearer instruction, not another attempt at the same one.
Context failure means the agent missed an existing function, pattern, or convention. The fix is better context, not a longer conversation.
Judgment failure means the agent chose an approach that violates the design. The fix is a stronger model for that decision, or constraints that remove the decision.
Completion failure means the agent stopped before the work was done. The fix is a stricter acceptance gate, not another pass at a task the model already thinks it finished.

Retrying all four the same way guarantees waste. At least three receive the wrong treatment.

Classification is the gate

Repair cannot be routed until failure has been named. Classification must come first, and it must happen outside the model’s own judgment.

The acceptance gate classifies from evidence: the diff, changed files, commands run, missing artifacts, forbidden patterns, and acceptance result. The model’s opinion of its own work does not count.

Naive loop

No gate, no classification, no targeted repair. There is only another pass through the same loop.

Abracapocus

The gate decides. The diagnosis determines the repair.

This is the step the naive loop skips. Without a gate, there is no classification. Without classification, there is no targeted repair. There is only the loop.

Fresh context is not optional

The counterintuitive part is this: retrying in the same context can make the next attempt worse.

Once a model fails, the failed attempt becomes part of the record. The model anchors on what it tried. It defends its earlier reasoning. In effect, it argues with itself, and the wrong answer spoke first.

A targeted repair with fresh context almost always beats same-context retry. The repair starts clean. It carries forward only the relevant state and the diagnosis. It is free from the failed attempt’s gravity.

It is the difference between asking someone to reconsider and giving someone else the corrected brief.

Recovery unit

The unit of recovery is the failed task, not the whole goal.

Escalate by evidence, not exhaustion

Naive retry escalates when patience runs out. After N attempts, it either quits or throws a bigger model at the problem.

A failure economy escalates by evidence.

The path is deliberate: a cheap model attempts the task; the gate diagnoses any failure; a typed repair handles that failure with fresh context. Only if that repair fails does the system escalate to the model best suited to that kind of failure.

An expensive model should be invoked because evidence shows a cheaper one cannot finish, not because a counter reached three.

Blind escalation

Pay 1x, then 1x again, then 1x again, then pay for a frontier model on a problem the system never understood.

Tiered repair

Pay 1x, classify the failure at the gate, then pay for one targeted repair on a cheap model with fresh context.

Most tasks end there. The expensive model is never touched.

That is where token savings come from at scale: not merely from cheaper models, but from refusing to pay repeatedly for the same undiagnosed mistake.

In practice

A cheap model writes a function that duplicates one already in the codebase. The naive loop retries. With no new information, the model is likely to write the duplicate again.

The gate sees the duplicate symbol and classifies the failure as context failure. The system creates a repair task with fresh context: the existing function’s path, the relevant convention, and the instruction to reuse it.

Told plainly that the function exists, the cheap model uses it. The task passes on the second attempt. The expensive model is never called.

Same failure. One system diagnosed it. The other paid twice and still got the wrong answer.

What this makes possible

Cheap models do not become dependable by never failing. They become dependable when the system around them can diagnose failure cheaply and repair only the broken slice.

That is what turns a model too weak for an open-ended goal into one you can run, by the dozen, across a large project. Every failure is caught, named, and fixed at the lowest cost that can actually fix it.

Treating every failure as identical is the most expensive assumption in agentic development.

Diagnosis is what makes the economics work.