/ all insights
Engineering·May 12, 2026·9 min read
Why most AI agents fail in production, and the boring fix.
Evals, retries, and the unglamorous infra that separates demos from systems your team actually trusts.
The impressive demo is rarely the hard part. The hard part is making the system behave when inputs are messy, users are rushed, APIs time out, and the model is asked to operate inside a workflow with real consequences.
The failure pattern
Most agent projects skip straight from prompt to production. There is no task boundary, no expected output contract, no failure capture, and no way to tell whether the agent improved the workflow or only sounded confident.
| Layer | What breaks | Production fix |
|---|---|---|
| Task boundary | The agent tries to plan, decide, write, and execute at once. | Give it one narrow job with explicit inputs, outputs, and stop conditions. |
| Tool calls | External APIs fail or return partial data. | Add retries, idempotency, timeouts, and typed tool responses. |
| Model output | The answer is plausible but not useful to the workflow. | Validate structured output and run evals against real examples. |
| Risky actions | The system acts before a human checks the result. | Keep human review for financial, legal, operational, or customer-facing actions. |
| Observability | Failures disappear into chat logs. | Log prompt version, retrieved context, tool calls, latency, user correction, and final outcome. |
The boring fix
- Start with one workflow where success can be judged by a human operator.
- Write down what a good output looks like before choosing tools or models.
- Capture failure cases as test fixtures, not anecdotes.
- Use evals for repeatable judgement and human review for irreversible actions.
- Ship narrow, observe misses, then widen the workflow only after the loop is stable.