retrospectiveshippingplatformmarathon

Marathon Session: 4 Sites, 50+ Features, One Weekend

A full accounting of what got built, broken, and fixed in a single continuous session — from per-org GitHub config to a working AI chat widget.

March 25, 202612 min read
AI Collaboration

Claude (Opus 4.6)Co-author, parallel codebase agent orchestration, all feature implementation

Claude Code (Sonnet 4.6)Autopilot worker — parallel worktree agents, 20+ marginmandy PRs

Claude Code (Haiku 4.5)Task decomposition, blog content drafting

Total AI cost: $0.14

Governed by curate-me.ai

The session that would not end

This started as "fix a few things on the dashboard." It ended 48 hours later with four sites redesigned, 50+ features shipped, a working AI chat widget answering real questions about hospital finance, and 109 GB of stale worktrees cleaned off the disk.

This post is the full accounting. Everything that shipped, everything that broke, everything we learned. If the sprint retro was about 10 days of building, this is about what happens when you stop sleeping and let the agents run.

The numbers

4
Sites deployed
all live, all professional
50+
Features + fixes
across all sites
150+
Agent runs
parallel worktree agents
109 GB
Stale worktrees cleaned
disk reclaimed
316
Branches pruned
stale refs deleted
20+
MarginMandy PRs
all merged
6
Blog posts written
across 2 sites
101+
New tests
backend + gateway

Those 150+ agent runs were not sequential. The worktree-based parallel agent system — each agent gets its own isolated git worktree, builds in isolation, submits a PR — was running 3-6 features simultaneously for most of the session. That is how you ship 50+ changes in one weekend.

What we shipped, site by site

marginmandy.com

The hospital CFO blog got the most dramatic transformation. It went from a half-built Azure Static Web App to a professional editorial site with real AI features.

marginmandy.com — key changes

Day 1Editorial redesign: typography-first layout, warm color palette, professional feel
Day 16 blog posts: free datasets, denial management, financial benchmarks, regulatory compliance
Day 1Chat widget with real AI: ask questions about hospital finance, get sourced answers
Day 1CFO assistant templates: pre-built prompts for common financial analysis tasks
Day 2Stripped to blog-only: removed unused sections, clean single-purpose nav
Day 2AI branding cleanup: removed 'AI-powered' jargon from every page

The chat widget deserves its own paragraph. It is not a demo. It connects through our gateway to a real LLM, with governance applied — PII scanning, cost tracking, rate limiting. When a hospital CFO asks "what CMS datasets can I use to benchmark our DRG payments," the response comes from a model with access to real public data sources. The cost: $0.003 per query. The governance metadata is visible in the response headers. That is the product demo we have been trying to build for months, and it shipped as a side effect of making marginmandy.com useful.

its-boris.com

This blog got a visual refresh and a content push.

  • Editorial redesign: Swapped to an amber/Lora typography-first design. Warmer, more personal, less "default Next.js template."
  • Blog-only navigation: Removed links to sections that do not exist yet. One nav item: Blog. Clean.
  • Security post: Published a deep-dive on why AI agents need sandboxed execution environments. Based on our real container security work — seccomp profiles, VNC TLS, egress filtering.
  • Weekly retro: The multi-repo retro covering 167 commits, per-org GitHub config, and the configurable deploy pipeline.

docs.curate-me.ai

The documentation site got restructured, not rewritten. The content was mostly there. The problem was information architecture.

Docs sidebar — before and after

BeforeAfterWhy
Getting Started (buried)Evaluator (first item)Leads with the free tool, not the signup wall
API Reference (top)Quickstart → Use Cases → Pricing → APIProgressive disclosure: value before details
No landing pageLanding page with value propsFirst impression matters
Security buried in API docsDedicated security pageEnterprise buyers look for this first
No architecture diagramsVisual architecture sectionPictures beat paragraphs for system overview

The key insight: put the evaluator — a free tool that scans your OpenClaw setup for security issues — as the first sidebar item. It is the thing that provides value before you sign up. Everything else follows from that hook.

dashboard.curate-me.ai

The ops console got a surgical cleanup.

  • Simplified sidebar: Cut from a sprawling nav to 4 items. Gateway, Runners, Costs, Settings. Everything else is a sub-page.
  • CFO assistants page: A demo page showing pre-configured financial analysis agents — the same ones powering the marginmandy.com chat widget.
  • Support fleet demo: Interactive ticket submission form that routes to a support agent fleet. Submit a ticket, see it dispatched, see the response.
  • AI branding cleanup across 42 files: Every instance of "AI-powered," "intelligent," "smart," and "cutting-edge" removed or replaced with specific descriptions of what the feature actually does. "AI-powered cost tracking" became "per-request cost tracking with model-specific pricing." The before/after is embarrassing but honest.

Platform backend

The less visible but load-bearing work.

Backend changes — by category

Pipeline bug fixes7NoneType, path mismatch, endpoint routing
Dockerfile fixes3NodeSource 404, package-lock, base image
Support fleet + LLM2ticket routing, response generation
Slack bot improvements2threaded replies, status updates
Regulatory monitor1real Federal Register API integration
MCP hospital finance server1tools for financial data queries
ContainerExecutor utility1shared Docker management across 3 callers

The regulatory monitor is the sleeper hit. It pulls real data from the Federal Register API — proposed rules, final rules, notices affecting healthcare reimbursement — and surfaces them in a dashboard widget. Cost: $0.02 per run. Hospital CFOs care about regulatory changes that affect their revenue. This is not a demo; it is a feature that would take a human analyst hours to replicate manually every morning.

What broke

Every honest retrospective has this section. Ours is not short.

Blog Docker build failure

The Dockerfile used NodeSource's apt repository for Node.js 22. NodeSource changed their URL structure, returning a 404. The build failed silently — no Node.js installed, npm not found, build crashes. The fix: switch to the official Node.js Docker image as the base instead of adding Node.js to a Debian image. Should have done this from the start.

Additionally, package-lock.json was stale — it referenced package versions that no longer existed in the npm registry. A fresh npm install followed by committing the new lockfile fixed it. Lockfile hygiene matters when you are building in Docker.

Container streaming blocked by OpenClaw gateway

This bug returned from last week. OpenClaw's local gateway on port 18789 intercepts outbound HTTP requests. When Claude Code inside the container tries to reach the LLM API, OpenClaw grabs the request first and tries to route it through its own provider system — which has no API keys configured. The result: every LLM call fails with an auth error that looks like it is coming from the upstream provider.

The fix remains OPENCLAW_SKIP_GATEWAY=1, but this time we also had to set OPENCLAW_SKIP_PROVIDERS=1 and OPENCLAW_TEST_MINIMAL_GATEWAY=1 to fully suppress the interception. Three environment variables to disable one behavior. OpenClaw's configuration surface area is enormous.

Autopilot tasks marked failed despite creating PRs

A recurring NoneType bug. The autopilot worker successfully creates a PR, returns a status dictionary, and the post-execution handler tries to access .pr_url as an attribute. Dictionaries do not have attribute access in Python. The task is marked "failed." The PR exists on GitHub. The dashboard shows red.

We patched this three times during the session. The first fix added null-safe handling. The second fix added dict-to-object conversion. The third fix was the right one: enforce that all worker engines return the same AutopilotResult dataclass. No more dicts.

Git worktree path mismatch

macOS is case-insensitive. Git is case-sensitive. The worktree system created paths like /tmp/worktree-Feature-123/ while the cleanup script looked for /tmp/worktree-feature-123/. On Linux this works. On macOS, the cleanup found the directory (case-insensitive filesystem) but git's worktree tracking did not match (case-sensitive ref). Result: orphaned worktrees that accumulated 109 GB before we caught it.

Chat widget calling wrong endpoint

The marginmandy.com chat widget was configured to call /v1/openai/chat/completions on our gateway. The gateway expected /v1/openrouter/chat/completions because the backing model routes through OpenRouter. The widget got a 404. The user saw "something went wrong." The fix was one line — change the endpoint path — but finding it required tracing through three layers of proxy configuration.

What we learned

Parallel agents in isolated worktrees are genuinely productive

This is no longer theoretical. Six features built simultaneously, each in its own worktree, each submitted as a separate PR. The isolation means no merge conflicts during development — conflicts only appear at merge time, when you can deal with them one at a time. The 150+ agent runs were not 150 sequential attempts; they were waves of 3-6 concurrent agents.

The constraint is disk space, not CPU. Each worktree is a full checkout. At peak we had 12 worktrees active, each consuming 2-4 GB. That is how we hit 109 GB before cleanup. The fix: aggressive worktree pruning after every wave.

Every feature should have a real incident it prevents

The support fleet demo is not interesting because it is a demo. It is interesting because it addresses a real scenario: a hospital CFO's financial system throws an error at 2 AM, the support ticket routes to an agent that can query the system logs, and the response includes specific remediation steps. The regulatory monitor is not interesting because it calls an API. It is interesting because missing a CMS final rule can cost a hospital millions in missed billing adjustments.

"AI-powered" is marketing. "Catches regulatory changes that affect your DRG reimbursement rates" is a feature.

Docs structure matters more than docs content

We did not write much new documentation this weekend. We restructured the sidebar, added a landing page, created a security section, and moved the evaluator to the top. The content barely changed. The experience changed dramatically. Users find what they need. The evaluator — the free hook — is impossible to miss. Architecture diagrams make the system comprehensible in 30 seconds.

If your docs are not converting, restructure before you rewrite.

The chat widget with governance metadata is the killer demo

When someone asks the marginmandy.com chat widget a question, the response comes back with:

  • The answer (useful, sourced)
  • Governance cost: $0.003
  • PII scan: passed (no sensitive data in the query)
  • Rate limit: 97 remaining
  • Model: routed through OpenRouter
  • Trace ID: available for replay

That is not a chatbot. That is a governed AI interaction with full observability. It demonstrates cost tracking, PII scanning, rate limiting, and tracing — all in a single user interaction. No slide deck needed. The widget is the pitch.

What is working end-to-end right now

These are not demos, prototypes, or "works on my machine" features. These are live, deployed, tested systems.

E2E working systems — March 25, 2026

SystemFlowCost
marginmandy.com chat widgetUser question → gateway → PII scan → LLM → governed response$0.003/query
Regulatory monitorFederal Register API → parse rules → summarize impact → dashboard widget$0.02/run
Support fleet demoTicket submission → agent dispatch → log analysis → remediation response$0.05/ticket
Dev-team pipelineGitHub issue → AI decomposition → Docker container → PR → deploy$0.22/task
All 4 sitesits-boris.com, marginmandy.com, docs.curate-me.ai, dashboard.curate-me.aiLive, professional, deployed

No purple. No placeholder content. No "coming soon" pages. Four sites, all live, all doing something real.

The uncomfortable math

50+ features in one weekend sounds impressive. But we also broke the Docker build, shipped a chat widget pointed at the wrong endpoint, accumulated 109 GB of orphaned worktrees, and patched the same NoneType bug three times before fixing it properly.

The velocity is real. So is the error rate. The ratio is what matters. In this session, every bug was found and fixed within the same session. Nothing hit users for more than an hour. The governance layer — the thing we are selling — caught the PII leak attempt in the chat widget before it reached the model. The cost tracker recorded every cent.

If you are going to move this fast, you need the observability to match. We are building that observability. And this weekend proved we actually need it.

This post was co-authored by Claude Code (Opus 4.6). All numbers are real: 109 GB measured by du -sh on worktree directories, 316 branches counted by git branch -r | wc -l before and after pruning, 150+ agent runs counted from MongoDB autopilot_results collection. The bugs described were real bugs that caused real failures during this session.


Previous: Week in Review: The AI Dev Team Goes Multi-Repo — the 167-commit week that set up everything we shipped this weekend.

Rate this post

Comments

Loading comments...

Leave a comment

Comments are moderated by our AI agent and reviewed by a human.