Marathon Session: 4 Sites, 50+ Features, One Weekend
A full accounting of what got built, broken, and fixed in a single continuous session — from per-org GitHub config to a working AI chat widget.
Claude (Opus 4.6) — Co-author, parallel codebase agent orchestration, all feature implementation
Claude Code (Sonnet 4.6) — Autopilot worker — parallel worktree agents, 20+ marginmandy PRs
Claude Code (Haiku 4.5) — Task decomposition, blog content drafting
Governed by curate-me.ai
The session that would not end
This started as "fix a few things on the dashboard." It ended 48 hours later with four sites redesigned, 50+ features shipped, a working AI chat widget answering real questions about hospital finance, and 109 GB of stale worktrees cleaned off the disk.
This post is the full accounting. Everything that shipped, everything that broke, everything we learned. If the sprint retro was about 10 days of building, this is about what happens when you stop sleeping and let the agents run.
The numbers
Those 150+ agent runs were not sequential. The worktree-based parallel agent system — each agent gets its own isolated git worktree, builds in isolation, submits a PR — was running 3-6 features simultaneously for most of the session. That is how you ship 50+ changes in one weekend.
What we shipped, site by site
marginmandy.com
The hospital CFO blog got the most dramatic transformation. It went from a half-built Azure Static Web App to a professional editorial site with real AI features.
marginmandy.com — key changes
The chat widget deserves its own paragraph. It is not a demo. It connects through our gateway to a real LLM, with governance applied — PII scanning, cost tracking, rate limiting. When a hospital CFO asks "what CMS datasets can I use to benchmark our DRG payments," the response comes from a model with access to real public data sources. The cost: $0.003 per query. The governance metadata is visible in the response headers. That is the product demo we have been trying to build for months, and it shipped as a side effect of making marginmandy.com useful.
its-boris.com
This blog got a visual refresh and a content push.
- Editorial redesign: Swapped to an amber/Lora typography-first design. Warmer, more personal, less "default Next.js template."
- Blog-only navigation: Removed links to sections that do not exist yet. One nav item: Blog. Clean.
- Security post: Published a deep-dive on why AI agents need sandboxed execution environments. Based on our real container security work — seccomp profiles, VNC TLS, egress filtering.
- Weekly retro: The multi-repo retro covering 167 commits, per-org GitHub config, and the configurable deploy pipeline.
docs.curate-me.ai
The documentation site got restructured, not rewritten. The content was mostly there. The problem was information architecture.
Docs sidebar — before and after
| Before | After | Why |
|---|---|---|
| Getting Started (buried) | Evaluator (first item) | Leads with the free tool, not the signup wall |
| API Reference (top) | Quickstart → Use Cases → Pricing → API | Progressive disclosure: value before details |
| No landing page | Landing page with value props | First impression matters |
| Security buried in API docs | Dedicated security page | Enterprise buyers look for this first |
| No architecture diagrams | Visual architecture section | Pictures beat paragraphs for system overview |
The key insight: put the evaluator — a free tool that scans your OpenClaw setup for security issues — as the first sidebar item. It is the thing that provides value before you sign up. Everything else follows from that hook.
dashboard.curate-me.ai
The ops console got a surgical cleanup.
- Simplified sidebar: Cut from a sprawling nav to 4 items. Gateway, Runners, Costs, Settings. Everything else is a sub-page.
- CFO assistants page: A demo page showing pre-configured financial analysis agents — the same ones powering the marginmandy.com chat widget.
- Support fleet demo: Interactive ticket submission form that routes to a support agent fleet. Submit a ticket, see it dispatched, see the response.
- AI branding cleanup across 42 files: Every instance of "AI-powered," "intelligent," "smart," and "cutting-edge" removed or replaced with specific descriptions of what the feature actually does. "AI-powered cost tracking" became "per-request cost tracking with model-specific pricing." The before/after is embarrassing but honest.
Platform backend
The less visible but load-bearing work.
Backend changes — by category
The regulatory monitor is the sleeper hit. It pulls real data from the Federal Register API — proposed rules, final rules, notices affecting healthcare reimbursement — and surfaces them in a dashboard widget. Cost: $0.02 per run. Hospital CFOs care about regulatory changes that affect their revenue. This is not a demo; it is a feature that would take a human analyst hours to replicate manually every morning.
What broke
Every honest retrospective has this section. Ours is not short.
Blog Docker build failure
The Dockerfile used NodeSource's apt repository for Node.js 22. NodeSource changed their URL structure, returning a 404. The build failed silently — no Node.js installed, npm not found, build crashes. The fix: switch to the official Node.js Docker image as the base instead of adding Node.js to a Debian image. Should have done this from the start.
Additionally, package-lock.json was stale — it referenced package versions that no longer existed in the npm registry. A fresh npm install followed by committing the new lockfile fixed it. Lockfile hygiene matters when you are building in Docker.
Container streaming blocked by OpenClaw gateway
This bug returned from last week. OpenClaw's local gateway on port 18789 intercepts outbound HTTP requests. When Claude Code inside the container tries to reach the LLM API, OpenClaw grabs the request first and tries to route it through its own provider system — which has no API keys configured. The result: every LLM call fails with an auth error that looks like it is coming from the upstream provider.
The fix remains OPENCLAW_SKIP_GATEWAY=1, but this time we also had to set OPENCLAW_SKIP_PROVIDERS=1 and OPENCLAW_TEST_MINIMAL_GATEWAY=1 to fully suppress the interception. Three environment variables to disable one behavior. OpenClaw's configuration surface area is enormous.
Autopilot tasks marked failed despite creating PRs
A recurring NoneType bug. The autopilot worker successfully creates a PR, returns a status dictionary, and the post-execution handler tries to access .pr_url as an attribute. Dictionaries do not have attribute access in Python. The task is marked "failed." The PR exists on GitHub. The dashboard shows red.
We patched this three times during the session. The first fix added null-safe handling. The second fix added dict-to-object conversion. The third fix was the right one: enforce that all worker engines return the same AutopilotResult dataclass. No more dicts.
Git worktree path mismatch
macOS is case-insensitive. Git is case-sensitive. The worktree system created paths like /tmp/worktree-Feature-123/ while the cleanup script looked for /tmp/worktree-feature-123/. On Linux this works. On macOS, the cleanup found the directory (case-insensitive filesystem) but git's worktree tracking did not match (case-sensitive ref). Result: orphaned worktrees that accumulated 109 GB before we caught it.
Chat widget calling wrong endpoint
The marginmandy.com chat widget was configured to call /v1/openai/chat/completions on our gateway. The gateway expected /v1/openrouter/chat/completions because the backing model routes through OpenRouter. The widget got a 404. The user saw "something went wrong." The fix was one line — change the endpoint path — but finding it required tracing through three layers of proxy configuration.
What we learned
Parallel agents in isolated worktrees are genuinely productive
This is no longer theoretical. Six features built simultaneously, each in its own worktree, each submitted as a separate PR. The isolation means no merge conflicts during development — conflicts only appear at merge time, when you can deal with them one at a time. The 150+ agent runs were not 150 sequential attempts; they were waves of 3-6 concurrent agents.
The constraint is disk space, not CPU. Each worktree is a full checkout. At peak we had 12 worktrees active, each consuming 2-4 GB. That is how we hit 109 GB before cleanup. The fix: aggressive worktree pruning after every wave.
Every feature should have a real incident it prevents
The support fleet demo is not interesting because it is a demo. It is interesting because it addresses a real scenario: a hospital CFO's financial system throws an error at 2 AM, the support ticket routes to an agent that can query the system logs, and the response includes specific remediation steps. The regulatory monitor is not interesting because it calls an API. It is interesting because missing a CMS final rule can cost a hospital millions in missed billing adjustments.
"AI-powered" is marketing. "Catches regulatory changes that affect your DRG reimbursement rates" is a feature.
Docs structure matters more than docs content
We did not write much new documentation this weekend. We restructured the sidebar, added a landing page, created a security section, and moved the evaluator to the top. The content barely changed. The experience changed dramatically. Users find what they need. The evaluator — the free hook — is impossible to miss. Architecture diagrams make the system comprehensible in 30 seconds.
If your docs are not converting, restructure before you rewrite.
The chat widget with governance metadata is the killer demo
When someone asks the marginmandy.com chat widget a question, the response comes back with:
- The answer (useful, sourced)
- Governance cost: $0.003
- PII scan: passed (no sensitive data in the query)
- Rate limit: 97 remaining
- Model: routed through OpenRouter
- Trace ID: available for replay
That is not a chatbot. That is a governed AI interaction with full observability. It demonstrates cost tracking, PII scanning, rate limiting, and tracing — all in a single user interaction. No slide deck needed. The widget is the pitch.
What is working end-to-end right now
These are not demos, prototypes, or "works on my machine" features. These are live, deployed, tested systems.
E2E working systems — March 25, 2026
| System | Flow | Cost |
|---|---|---|
| marginmandy.com chat widget | User question → gateway → PII scan → LLM → governed response | $0.003/query |
| Regulatory monitor | Federal Register API → parse rules → summarize impact → dashboard widget | $0.02/run |
| Support fleet demo | Ticket submission → agent dispatch → log analysis → remediation response | $0.05/ticket |
| Dev-team pipeline | GitHub issue → AI decomposition → Docker container → PR → deploy | $0.22/task |
| All 4 sites | its-boris.com, marginmandy.com, docs.curate-me.ai, dashboard.curate-me.ai | Live, professional, deployed |
No purple. No placeholder content. No "coming soon" pages. Four sites, all live, all doing something real.
The uncomfortable math
50+ features in one weekend sounds impressive. But we also broke the Docker build, shipped a chat widget pointed at the wrong endpoint, accumulated 109 GB of orphaned worktrees, and patched the same NoneType bug three times before fixing it properly.
The velocity is real. So is the error rate. The ratio is what matters. In this session, every bug was found and fixed within the same session. Nothing hit users for more than an hour. The governance layer — the thing we are selling — caught the PII leak attempt in the chat widget before it reached the model. The cost tracker recorded every cent.
If you are going to move this fast, you need the observability to match. We are building that observability. And this weekend proved we actually need it.
This post was co-authored by Claude Code (Opus 4.6). All numbers are real: 109 GB measured by du -sh on worktree directories, 316 branches counted by git branch -r | wc -l before and after pruning, 150+ agent runs counted from MongoDB autopilot_results collection. The bugs described were real bugs that caused real failures during this session.
Previous: Week in Review: The AI Dev Team Goes Multi-Repo — the 167-commit week that set up everything we shipped this weekend.
Rate this post
Comments
Loading comments...