Webhook-Driven Agent Architecture

The communication problem

When AI agents run on external infrastructure (BYOVM runners, cloud containers, edge workers), you need a way for them to send results back to your application. Three options:

Polling — Your app periodically asks "is there a result yet?" Simple but wasteful. With 9 agents, you'd be polling constantly.

WebSockets — Persistent bidirectional connection. Real-time, but complex to manage across container restarts, network interruptions, and load balancers.

Webhooks — Agents POST results to a URL when they're done. Stateless, retry-friendly, works with any infrastructure.

We chose webhooks. Here's why — and what we learned implementing a single endpoint that handles 15 different event types.

The single endpoint pattern

All agent results go to one URL: POST /api/webhooks/agent. The payload includes a type field that determines how the result is processed:

{
  "type": "research_brief",
  "agent": "blog-researcher",
  "payload": {
    "topic": "AI agent governance trends",
    "summary": "Three key developments this week...",
    "sources": ["https://..."]
  }
}

A single endpoint (instead of per-agent or per-event-type endpoints) has advantages:

One place to add authentication — The webhook secret check happens once
One place to add logging — Every event gets logged to agent_logs
One place to handle errors — Consistent error responses and retries
One place to add fleet memory writes — Learnings are captured after every event

The tradeoff is a large handler function. Ours is ~1,000 lines. But it's a router — each type case is independent and easy to reason about in isolation.

Event types we handle

| Type | Agent | What happens | |------|-------|-------------| | research_brief | blog-researcher | Write to knowledge base + Slack notification | | draft_post | blog-writer | Save to drafts table + trigger refinement loop | | draft_revision | blog-writer | Update draft + continue refinement | | daily_digest | daily-digest | Full pipeline: refine → approve → publish → email | | digest_approved | (Slack action) | Publish post + send newsletter | | moderation_action | blog-moderator | Apply to comments (approve/flag/remove) | | feedback_triage | blog-analyst | Update feedback status + extract patterns | | analytics_report | blog-analyst | Write insights to KB + fleet memory | | social_content | blog-promoter | Draft social posts for HITL approval | | social_scan | social-scanner | Write trends to KB | | openclaw_update | openclaw-tracker | Write release notes to KB | | fleet_health | fleet-monitor | Log health status | | preview_ready | blog-dev | Configure Caddy route for preview URL | | dev_change | blog-dev | Code review + Slack notification | | code_review | (Opus reviewer) | Score + deploy decision |

Payload flexibility

Early on, we hit a problem: different OpenClaw versions structured payloads differently. Some sent nested payloads ({ type, agent, payload: {...} }), others sent flat bodies ({ type, agent, content, ... }).

Our handler gracefully handles both:

const { type, agent } = body;
const payload = body.payload ?? (() => {
  const { type: _t, agent: _a, payload: _p, ...rest } = body;
  return rest;
})();

This defensive parsing saved us from breaking changes in the runner infrastructure.

What gets logged

Every webhook event produces three artifacts:

Agent log entry — Stored in the agent_logs table with agent name, webhook type, full payload, and status
Fleet memory write — Key learnings extracted and written to shared memory for other agents to reference
Slack notification — Human-readable summary with interactive buttons where applicable

The logging is non-blocking — fleet memory writes and Slack notifications happen asynchronously with .catch() handlers. If Slack is down, the webhook still processes successfully.

Authentication

Webhooks use a shared secret:

function verifyWebhook(req: NextRequest): boolean {
  const secret = req.headers.get("x-webhook-secret");
  if (!WEBHOOK_SECRET) return true; // Allow in dev
  return secret === WEBHOOK_SECRET;
}

In dev, the check is skipped. In production, every request must include the correct x-webhook-secret header. The secret is shared between the blog and the curate-me.ai gateway, which includes it in all runner-to-blog webhook calls.

Retry handling

Webhooks can fail. Networks are unreliable. Our handler returns:

200 on success
401 on auth failure (don't retry)
400 on bad payload (don't retry)
500 on processing error (retry)

The curate-me.ai gateway automatically retries 5xx responses with exponential backoff. This means our handler needs to be idempotent — processing the same research brief twice should write the same knowledge base entry, not duplicate it.

Why not events/queues?

For a reference app, webhooks are the right choice. They're universally understood, easy to debug (just look at the logs), and don't require additional infrastructure.

For a higher-scale production system, you might want a message queue (SQS, Redis Streams, NATS) between the runners and the blog. But that adds complexity and infrastructure cost that doesn't serve the reference app's purpose of demonstrating curate-me.ai capabilities clearly and simply.

The full webhook handler is visible on GitHub, and the event flow is visualized on the how it works page.