/ all insights
Engineering·May 12, 2026·9 min read

Why most AI agents fail in production, and the boring fix.

Evals, retries, and the unglamorous infra that separates demos from systems your team actually trusts.

The impressive demo is rarely the hard part. The hard part is making the system behave when inputs are messy, users are rushed, APIs time out, and the model is asked to operate inside a workflow with real consequences.

The failure pattern

Most agent projects skip straight from prompt to production. There is no task boundary, no expected output contract, no failure capture, and no way to tell whether the agent improved the workflow or only sounded confident.

LayerWhat breaksProduction fix
Task boundaryThe agent tries to plan, decide, write, and execute at once.Give it one narrow job with explicit inputs, outputs, and stop conditions.
Tool callsExternal APIs fail or return partial data.Add retries, idempotency, timeouts, and typed tool responses.
Model outputThe answer is plausible but not useful to the workflow.Validate structured output and run evals against real examples.
Risky actionsThe system acts before a human checks the result.Keep human review for financial, legal, operational, or customer-facing actions.
ObservabilityFailures disappear into chat logs.Log prompt version, retrieved context, tool calls, latency, user correction, and final outcome.

The boring fix

  • Start with one workflow where success can be judged by a human operator.
  • Write down what a good output looks like before choosing tools or models.
  • Capture failure cases as test fixtures, not anecdotes.
  • Use evals for repeatable judgement and human review for irreversible actions.
  • Ship narrow, observe misses, then widen the workflow only after the loop is stable.