AI agents are ready for production. The accountability layer isn't.

For two years the headline question about AI agents was can they do the work? By the end of 2025 the answer is yes for most production-shaped tasks. The question teams actually spend their time on now is different: can we let them?

Orchestration is solved by Temporal, LangGraph, Inngest. Observability has rich options in LangSmith, Arize, Helicone. Cost is metered by every cloud provider. The fourth piece, the human checkpoint, is where most production agent projects still stall.

This post is about that gap. What it actually is, why it's harder than it looks, and what a production-grade primitive for it has to do.

What "production" means for an agent

A demo agent does one thing in front of one person. A production agent does many things, autonomously, against real systems, on behalf of users who never see the model itself.

That shift introduces a class of problems that doesn't exist in demos. The agent will eventually:

Spend money on a real card
Send an email to a real person
Make a refund decision on a real customer
Delete a row in a real database

Each of those is irreversible enough that "the agent might get it wrong sometimes" is not an acceptable failure mode. Not because the model is bad. Modern models are pretty good. The issue is that "pretty good" still produces enough wrong actions at production volume that you need a human to gate the consequential ones.

The same OpenAI that ships Operator (a fully autonomous browser agent) ships it with explicit instructions to its own model to ask for human approval before submitting an order or sending an email. They wrote that into the system prompt deliberately. The product reads like the bet is that human approval is a permanent part of any consequential agent flow.

The three walls

Three categories of decision genuinely need a human, and they don't get smaller as models improve. We've taken to calling them the three walls.

Wall 1: authorization. The agent has all the information it needs to act, but the action requires a human to be on the hook for what happens next. Approve a wire. Sign a contract. Place an order. The model isn't unsure. The accountability is what's missing, and a smarter model doesn't fix that. If anything, this wall gets higher as agents get more capable, because more is delegated to them.

Wall 2: reality. The agent has reached the edge of its context window. Something exists in the world that the agent hasn't been told. Did the customer reach out via a different channel? Did the product team just change the refund policy? The model is making a confident guess about something it can't actually see, and the only honest answer is to ask.

Wall 3: presence. The agent needs something to happen in the physical world. A document signed. A package received. A piece of hardware reset. No software primitive substitutes for a human body in a place.

Most production agent failures are one of these three. None go away by switching to a better model. The work isn't to eliminate the human checkpoint, it's to make pausing for a human a clean, reliable, first-class primitive in your agent runtime.

What goes wrong when teams build this themselves

Every team that ships a production agent eventually builds some flavor of human approval into it. The first version usually looks like this:

def approve_step(payload):
    print(payload)
    response = input("Approve? [y/n]: ")
    return response == "y"

That works exactly until you deploy behind a worker pool, an API gateway, or a Slack webhook. Then the rebuild begins:

Swap input() for a Slack DM. The agent now polls a database table for the response.
Split the agent into a "trigger" endpoint and a "resume" endpoint, because the worker can't hold a thread open waiting on a human.
Add a checkpointer to serialize agent state to Postgres, because workers restart and in-flight state gets lost.
Add schema validation, because the human responds with freeform text and the agent needs typed input. Now also handle the same human responding via email reply and the web dashboard.
Bolt on an audit table when compliance asks who approved what last quarter.

This is the work that should be a library, not a project. Most teams reach step 4 or 5 before they realize they're rebuilding the same primitive that every other team is also rebuilding.

What a production-grade human checkpoint actually needs

The right primitive is one function call. The function pauses the agent, sends the payload to a human, waits for a typed response, and returns the decision to the agent. Easy to describe. The hard part is what has to be true underneath that single call for it to work in production:

Durable. When the worker holding the agent crashes between pause and response, the agent's state survives. The pause is backed by a real database, not in-memory state. The resume is idempotent against duplicate deliveries.

Typed end to end. The payload the human sees is a typed schema (Pydantic, Zod, or equivalent). The response is also typed, validated server-side before the agent ever sees it. No parsing "yes" or "approve, but only up to $40" from a free-text string.

Multi-channel. The same task can deliver via Slack DM, email magic link, web dashboard, mobile push, or whatever channel gets adopted next. The choice is configuration, not code. Operators who live in Slack respond there. Operators who live in email respond there. The agent doesn't know or care which channel was used.

Auditable. Every approval (delivery, claim, response) writes a row with who, when, from which channel, what they saw. Compliance teams can answer the "who approved the wire to vendor X" question without grep'ing through Slack history.

Routable. The task knows who to notify (everyone in this Slack channel) and who can actually act (only the on-call engineer). Multiple notifications, single owner.

Optionally verified. Before the agent trusts the human's response, an LLM can pre-check it. Did the approver actually read the form? Does the response match the policy? Junior approvers clicking yes without reading is a real failure mode, and a verifier catches it.

Six properties. Most agent frameworks today ship one or two. We surveyed twelve of them and the highest score was 15 out of 30, which we wrote up in detail here. The primitive doesn't replace your orchestrator, it plugs into it: a signal-backed activity on Temporal, a wrapper around interrupt() on LangGraph, a @function_tool that serializes RunState on OpenAI Agents SDK. Same pattern every time. Your agent code stays clean. The pause-and-notify-and-resume machinery lives in one place that can be hardened over time.

The work ahead

If you're shipping an agent into production this quarter, the questions worth asking before you do are: where does my agent need a human checkpoint, how will the human be reached, what's the typed shape of their response, and what happens if the worker waiting for them crashes? The answers shouldn't be "I'll figure it out." They should be three lines of code and a config decision.

We've been building one called awaithumans (Apache 2.0, Python and TypeScript). It plugs into Temporal and LangGraph today via dedicated adapters, ships Slack, email, and a built-in web dashboard out of the box, and an optional Claude/OpenAI verifier. Telegram is in the next release.

The infrastructure to make the human checkpoint a solved primitive is finally arriving. The next year of production AI agents will be defined less by what the models can do, and more by the layer of accountability built around them.

Want to dig deeper? Read the 12-framework audit on how popular agent frameworks handle the human checkpoint, or check out the awaithumans library. Apache 2.0. Issues and PRs welcome.