explainx.ainewsletter3.4k
trending๐Ÿ”ฅloopsskills
pricing
workshops โ†—
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses โ€” plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join ยท $29/mo

learn

start for freepathwaysworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutcommunityteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter ยท weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

ยฉ 2026 AISOLO Technologies Pvt Ltd

โ† Back to blog

explainx / blog

How to Build an AI Agent Loop: Triggers, Retries, Checkpoints, and Human Handoffs

Learn how to design a production-grade ai agent loop architecture with proper triggers, state management, retry logic, checkpoints, and human handoff patterns that don't break silently under real workloads.

Jun 28, 2026ยท9 min readยทYash Thakker
AI agentsAgent architectureAgentic workflowAutonomous agentsLLM engineeringAgent design patternsClaude CodeProduction AI
How to Build an AI Agent Loop: Triggers, Retries, Checkpoints, and Human Handoffs

Last month a team shipped an agent that scraped competitor pricing, formatted a report, and emailed it to the sales team every morning. It worked perfectly in testing. In production, it ran for three weeks before anyone noticed it had been sending the same report repeatedly because the pricing page had changed its HTML structure. The agent never errored. It just quietly looped and delivered stale data.

That is the failure mode that kills production agents. Not dramatic crashes โ€” silent drift. The loop kept running, the termination condition was never triggered, and there was nothing in the output that looked wrong at a glance.

Building an agent loop that works in a demo is easy. Building one that holds up in production requires thinking through four primitives, three checkpoint rules, a retry strategy, and a clear policy on when to stop and ask a human. This guide covers all of it.


What an agent loop actually is in production

A demo agent loop is usually: call LLM, get a response, maybe call a tool, repeat. That works for a five-minute walkthrough.

A production agent loop has more moving parts:

Trigger โ†’ Task Queue โ†’ [State Read โ†’ Executor โ†’ State Write โ†’ Termination Check] โ†’ Output
                              ^__________________________|
                              (repeat until termination)

The trigger starts the loop โ€” a schedule, a webhook, a user action, or a prior agent completing its own loop. The task queue holds work units if multiple items need processing. Inside the loop body, the agent reads its current state, runs the executor (the LLM call plus any tool calls), writes updated state, and evaluates whether to continue or stop. Checkpoints and human gates live inside the loop body, not outside it.


The 4 primitives every agent loop needs

1. Trigger

The trigger answers: what starts this loop, and what data does it carry?

A trigger can be scheduled (cron), event-driven (webhook, message queue), or chained (prior agent output). The critical design question is what the trigger payload contains and whether the loop can validate it before starting execution.

@dataclass
class TriggerPayload:
    task_id: str          # stable, used as idempotency key root
    task_type: str        # determines which executor branch runs
    input_data: dict      # the actual work input
    triggered_by: str     # "schedule" | "webhook" | "agent:research-runner"
    triggered_at: str     # ISO timestamp
    max_iterations: int = 25

Validate the payload immediately. If task_id is missing, the idempotency system breaks downstream and you'll get duplicate side effects on retries.

2. State

State is what carries information across iterations. It is not the LLM's context window โ€” it is an external store that persists even if the loop crashes.

@dataclass
class AgentState:
    task_id: str
    iteration: int
    status: str           # "running" | "waiting_for_human" | "done" | "failed"
    working_memory: dict  # structured data the agent has collected
    action_log: list      # every action taken, for idempotency checks
    last_checkpoint: str  # ISO timestamp of last successful checkpoint
    error_count: int
    last_error: str | None

State should be serializable to JSON and written to a durable store (database, Redis with persistence, object storage) after every iteration. If the agent process crashes, the loop can be resumed by reading state back in.

3. Executor

The executor is the LLM call plus tool execution. It takes the current state and the task context as input, runs a step, and returns a result plus any state mutations.

def execute_step(state: AgentState, task_context: dict, llm_client) -> StepResult:
    messages = build_messages(state, task_context)
    
    response = llm_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        system=AGENT_SYSTEM_PROMPT,
        messages=messages,
        tools=AVAILABLE_TOOLS,
    )
    
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = dispatch_tool(block.name, block.input, state)
            tool_results.append(result)
    
    return StepResult(
        response=response,
        tool_results=tool_results,
        state_mutations=extract_state_mutations(response, tool_results),
    )

The executor should not make decisions about whether to continue. That belongs to the terminator.

4. Terminator

The terminator evaluates whether the loop should continue, stop successfully, or stop with an error. Keep termination logic explicit and separate from the executor.

def evaluate_termination(state: AgentState, step_result: StepResult) -> TerminationDecision:
    # Hard caps
    if state.iteration >= state.max_iterations:
        return TerminationDecision(should_stop=True, reason="max_iterations_reached", status="failed")
    
    if state.error_count >= 3:
        return TerminationDecision(should_stop=True, reason="too_many_errors", status="failed")
    
    # No-progress detection
    if is_stuck(state):
        return TerminationDecision(should_stop=True, reason="no_progress_detected", status="failed")
    
    # Success signals from the executor
    if step_result.signals_completion():
        return TerminationDecision(should_stop=True, reason="task_complete", status="done")
    
    # Human gate
    if step_result.requires_approval():
        return TerminationDecision(should_stop=True, reason="awaiting_human_approval", status="waiting_for_human")
    
    return TerminationDecision(should_stop=False)

The is_stuck check deserves special attention. Maintain a rolling window of the last N tool calls and their arguments. If the same call appears twice with identical arguments, the agent is not making progress. Stop it.


Where to put checkpoints

A checkpoint is a persisted snapshot of agent state at a known-good moment. Checkpoints make loops restartable. Without them, a crash at step 18 of a 20-step research task means starting from zero.

Three rules for checkpoint placement:

Before irreversible actions. If the next step will send an email, write to a database, call a payment API, or post to an external system โ€” checkpoint first. If the action fails mid-way, you can retry from the pre-action snapshot with the same state.

After expensive operations. If a step burned 10,000 tokens on a complex analysis or made several API calls to aggregate data, checkpoint immediately after. Losing that work to a downstream failure is expensive.

At confidence thresholds. If your executor returns a confidence score or uncertainty signal, checkpoint when the agent transitions from high-confidence to uncertain territory. That boundary is where you most often need to resume with human input.

def should_checkpoint(state: AgentState, step_result: StepResult) -> bool:
    # Always checkpoint before irreversible action
    if step_result.next_action_is_irreversible:
        return True
    # Checkpoint after expensive step (>1000 tokens of tool output)
    if step_result.total_tool_output_tokens > 1000:
        return True
    # Checkpoint at confidence threshold crossing
    if step_result.confidence < CONFIDENCE_THRESHOLD:
        return True
    return False

def save_checkpoint(state: AgentState, db):
    state.last_checkpoint = datetime.utcnow().isoformat()
    db.upsert("agent_states", state.task_id, state.to_dict())

Retry logic

Most agent failures are transient: rate limits, brief network issues, a tool API returning a 503. A well-designed retry strategy handles these without human involvement.

Exponential backoff with jitter. Start with a short base delay and double it on each retry, adding a random jitter to prevent thundering herd if multiple agents fail simultaneously.

import random
import time

def retry_with_backoff(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except RetryableError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Idempotency keys. Every write operation should carry an idempotency key before it's retried. Derive the key from a stable hash of the task ID and step index โ€” not from a timestamp, which changes on retry.

import hashlib

def make_idempotency_key(task_id: str, step_index: int, action_type: str) -> str:
    raw = f"{task_id}:{step_index}:{action_type}"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

What NOT to retry. Do not retry: LLM calls that returned a valid response (even an unexpected one), actions that succeeded but returned an error in the response body, or any operation you can't verify was idempotent. The safest rule is to only retry operations that raised a clear infrastructure exception (timeout, rate limit, network error) and that you can prove are idempotent.


Human handoffs: the decision framework

The hardest design question in any agent loop is where to put human gates. Gate too aggressively and you've built an expensive email drafting assistant. Gate too loosely and you've given an agent unsupervised authority over consequential actions.

The decision axis is: reversibility times blast radius.

ActionReversible?Blast radiusGate it?
Read a fileYesNoneNo
Search the webYesNoneNo
Write a draftYesLowNo
Send an emailNoMediumYes
Post to social mediaNoHighYes
Delete database recordsNoHighYes
Approve a financial transactionNoVery highYes
Update a config in productionNoVery highYes

When you do gate, the gate must be non-blocking for the agent. The loop pauses, writes a human-approval task to a queue, and the agent state is persisted in "waiting_for_human" status. When the human acts, the queue triggers the loop to resume from that state.

def request_human_approval(state: AgentState, action: dict, approval_queue) -> None:
    approval_task = {
        "task_id": state.task_id,
        "action": action,
        "context_summary": summarize_state_for_human(state),
        "approval_url": f"https://app.explainx.ai/agent-approvals/{state.task_id}",
        "expires_at": (datetime.utcnow() + timedelta(hours=24)).isoformat(),
    }
    approval_queue.enqueue(approval_task)
    state.status = "waiting_for_human"
    save_checkpoint(state, db)

The approval URL pattern โ€” where a human can review context and click approve or reject โ€” is far better than email-based approval. Email gets buried. A dedicated approval UI surfaces the full agent context.


Three loop patterns with code

Pattern 1: Simple task-runner loop

For single-task, single-pass work with clear completion criteria.

def run_task_loop(trigger: TriggerPayload, db, llm_client) -> AgentState:
    state = load_or_init_state(trigger, db)
    
    while True:
        # Execute one step
        step_result = execute_step(state, trigger.input_data, llm_client)
        
        # Update state
        state.iteration += 1
        state.working_memory.update(step_result.state_mutations)
        state.action_log.append(step_result.summary())
        
        # Checkpoint if needed
        if should_checkpoint(state, step_result):
            save_checkpoint(state, db)
        
        # Evaluate termination
        decision = evaluate_termination(state, step_result)
        if decision.should_stop:
            state.status = decision.status
            save_checkpoint(state, db)
            return state
    
    return state

Pattern 2: Multi-step research loop with state persistence

For tasks that span multiple research passes, where each pass informs the next.

def run_research_loop(trigger: TriggerPayload, db, llm_client) -> AgentState:
    state = load_or_init_state(trigger, db)
    
    # Research loop: gather, synthesize, verify
    phases = ["gather", "synthesize", "verify"]
    
    for phase in phases:
        state.working_memory["current_phase"] = phase
        phase_iterations = 0
        
        while phase_iterations < 5:
            step_result = execute_step(state, {**trigger.input_data, "phase": phase}, llm_client)
            
            state.iteration += 1
            phase_iterations += 1
            state.working_memory.update(step_result.state_mutations)
            state.action_log.append(step_result.summary())
            
            # Checkpoint after every gather step (expensive)
            if phase == "gather":
                save_checkpoint(state, db)
            
            if step_result.phase_complete():
                break
            
            decision = evaluate_termination(state, step_result)
            if decision.should_stop:
                state.status = decision.status
                save_checkpoint(state, db)
                return state
    
    state.status = "done"
    save_checkpoint(state, db)
    return state

The key difference: state carries the current phase, so if the process crashes mid-gather, it resumes in the gather phase rather than starting from scratch.

Pattern 3: Human-gated approval loop

For loops where certain actions require explicit human sign-off before proceeding.

def run_approval_loop(trigger: TriggerPayload, db, llm_client, approval_queue) -> AgentState:
    state = load_or_init_state(trigger, db)
    
    while True:
        step_result = execute_step(state, trigger.input_data, llm_client)
        
        state.iteration += 1
        state.working_memory.update(step_result.state_mutations)
        state.action_log.append(step_result.summary())
        
        # Check if this step wants to take an irreversible action
        if step_result.proposed_action and is_irreversible(step_result.proposed_action):
            save_checkpoint(state, db)  # checkpoint before gate
            request_human_approval(state, step_result.proposed_action, approval_queue)
            return state  # loop suspends here
        
        if should_checkpoint(state, step_result):
            save_checkpoint(state, db)
        
        decision = evaluate_termination(state, step_result)
        if decision.should_stop:
            state.status = decision.status
            save_checkpoint(state, db)
            return state
    
    return state

def resume_after_approval(task_id: str, approved: bool, db, llm_client, approval_queue) -> AgentState:
    state = load_state(task_id, db)
    
    if not approved:
        state.status = "failed"
        state.working_memory["rejection_reason"] = "human_rejected_action"
        save_checkpoint(state, db)
        return state
    
    # Inject approval into state so the agent knows it was cleared
    state.working_memory["last_approved_action"] = state.action_log[-1]
    state.status = "running"
    
    # Resume the loop
    return run_approval_loop(
        TriggerPayload(task_id=task_id, **state.working_memory["original_trigger"]),
        db, llm_client, approval_queue
    )

This pattern is central to building agents that can handle consequential work without requiring a human to babysit every step. The agent runs autonomously until it hits a gate, suspends cleanly, and picks up exactly where it left off after approval.

This is one of the three working templates built end-to-end in the Loop Engineering workshop at explainx.ai.


Common failure modes and fixes

Silent failures. The agent returns a result that looks valid but is wrong. Fix: add output validation after every executor call. Define a schema for what a valid step result looks like and reject results that don't match before updating state.

Infinite loops. The terminator never fires because the success signal is ambiguous. Fix: always set max_iterations at the trigger level, enforce it in the terminator as a hard cap, and log a warning (not a silent exit) when the cap is hit.

State corruption on concurrent runs. Two instances of the same loop run simultaneously and write conflicting state. Fix: use optimistic locking on the state store โ€” include a version field in the state and reject writes where the version doesn't match the current database version.

Retry storms. A transient failure causes the loop to retry aggressively, which overloads the downstream service, which causes more failures. Fix: cap total retries at the task level (not per-step), and implement circuit breakers on high-failure-rate operations.

Context window drift. In long loops, the message history passed to the executor grows until it degrades generation quality or hits the token limit. Fix: summarize old turns rather than passing the raw history โ€” keep the last N turns verbatim and compress earlier turns into a structured summary injected at the top of the message history.


Building agent loops as a career skill

Knowing how to design a loop that doesn't break silently in production is not a trivial skill. It requires understanding LLM behavior across multi-turn execution, distributed systems patterns (idempotency, circuit breakers, state machines), and human-computer interaction design for approval workflows.

This is also increasingly a skill that forward-deployed engineers are expected to bring. If you're interviewing for FDE or AI engineering roles, expect questions about agent loop design and failure handling โ€” the ability to reason through what happens when an agent hits an unexpected state is now a standard interview topic.

The patterns in this guide are practical starting points. What takes more time is implementing them against real infrastructure, debugging the edge cases that only appear under production load, and learning to tune termination conditions for your specific task type.


Loop Engineering workshop: July 20, 2026

If you want to build all three loop patterns from this guide as working systems โ€” not pseudocode, but actual running loops with state persistence, checkpoint saves, and a human approval UI โ€” the Loop Engineering workshop at explainx.ai covers exactly that.

What the session covers:

  • Wiring up triggers: scheduled, webhook, and chained agent triggers
  • State design and persistence with a real database backend
  • Retry logic with idempotency keys that actually work
  • Checkpoint placement and resume-from-checkpoint logic
  • Human-gated approval UI with a real queue-and-resume pattern
  • Three working loop templates you leave with: multi-step research runner, human-gated content pipeline, failure-resilient task runner

Details: July 20, 2026 โ€” one four-hour live session.

You can sign up at explainx.ai/workshops/loop-engineering. Seats are limited for the live session.


The core insight from every production agent loop that has failed: the problem is almost never the LLM. It's the infrastructure around the LLM โ€” the missing checkpoint that forces a full restart, the retry logic that doesn't account for idempotency, the terminator that relies on the model to decide it's done rather than enforcing a hard cap. Get the loop architecture right and the LLM part becomes much easier to reason about.

Related posts

Jun 28, 2026

Human-in-the-Loop AI: When to Let the Agent Run and When to Stop It (2026)

Most AI agent failures aren't model failures โ€” they're gate failures. Someone gave an agent write access, delete access, or send access without deciding upfront which of those actions required a human checkpoint. This guide gives you the framework to fix that.

Jun 28, 2026

Agentic context design: how to engineer the context window for multi-turn AI systems in 2026

In agentic systems, context engineering errors compound across every turn. This guide covers how to design the context window for multi-turn AI agents: from initial setup through tool output injection, context evolution, and recovery from failure states.

Jun 28, 2026

Conversation history management for AI agents: what to keep, compress, and drop in 2026

Conversation history fills up context windows faster than anything else in agentic systems. This guide covers the four strategies for managing it โ€” full retention, sliding window, summarization, and selective pruning โ€” and when to use each.