Week in Review: The AI Dev Team Goes Multi-Repo
An honest engineering retrospective covering March 19-24, 2026 — 167 commits across 3 repos, 14 merged PRs on a brand-new project, per-org GitHub config, a configurable deploy pipeline, 12 pluggable worker engines, and every bug that bit us along the way. LLM cost: $0.
Claude (Opus 4.6) — Co-author, parallel codebase exploration, pipeline architecture, all 167 platform commits
Claude Code (Sonnet 4.6) — Autopilot worker — first cross-repo PR (marginmandy #5, 23 files)
Claude Code (Haiku 4.5) — Task decomposition, AI advisor complexity scoring
Governed by curate-me.ai
The week in one sentence
We took an AI dev team pipeline that worked on one repo and made it work on three — and the hardest part was not the code.
Context
Last week's sprint retro covered 319 commits, a 100x billing bug, and a broken governance chain that hid behind Docker's layer cache. This week's pipeline deep-dive explained the architecture: GitHub issue, AI decomposition, Docker container running Claude Code, automated review, configurable deploy. PR #1235 was the proof — 179 seconds, $0.22, two lines of code.
This post covers what happened next: taking that pipeline from a single-repo demo to a multi-repo system running against three different GitHub organizations with three different deploy targets. And the bugs we found along the way.
The numbers
167 commits on the platform repo in 6 days. 67 features, 80 bug fixes. More fixes than features — that ratio tells you this was an integration week, not a greenfield week. We were connecting real things to other real things, and every connection point had a bug.
On the Margin Mandy repo — a hospital CFO blog that started as an Azure Static Web App — we merged 14 PRs in one day. The repo went from a Contentful-backed SPA to a Docker-deployed MDX blog on our VPS. That was the first real test of the multi-repo pipeline.
What we actually shipped
1. Per-org GitHub config
The AI dev team pipeline was hardcoded to Curate-Me-ai/platform. Every board query, every PR creation, every issue fetch assumed one repo, one project board, one set of status field IDs.
This week we made it configurable. Each organization stores a github_config in its org_devteam_config MongoDB document:
Configured organizations
| Org | Repo | Deploy Target | Board |
|---|---|---|---|
| Curate-Me | Curate-Me-ai/platform | VPS Docker Compose | GitHub Projects v2 |
| Its Boris Blog | its-boris/blog | SSH + Docker restart | GitHub Projects v2 |
| Margin Mandy | its-boris/marginmandy | SSH + Docker restart | Not yet configured |
The blog board required special handling: it is a user-owned project (not org-owned), which means the GraphQL query uses user instead of organization. That one-word difference cost an hour of debugging. GitHub's Projects v2 API uses entirely different query roots depending on ownership type, and the error message when you get it wrong is just "Could not resolve to a ProjectV2."
2. Configurable deploy pipeline
The deploy pipeline went from "merge and SSH" to a proper step-based system with 9 step types and 4 templates:
Deploy pipeline — step types
The four templates: platform_vps (Docker Compose rebuild), blog_ssh (merge, SSH deploy, health check, close issue, Slack), static_deploy (for static sites), and manual (just merge and notify). Each org picks its template. Steps execute sequentially; if any step fails, the pipeline halts and posts a failure notification.
3. Twelve pluggable worker engines
The autopilot system started with one way to run code: spin up a Docker container with Claude Code CLI. That was fine for the platform repo, but different tasks need different engines. We built a WorkerEngineProtocol with 12 implementations:
- Claude Code CLI (Docker) — the original, full repo checkout
- Claude Code Opus — same container, explicit Opus model
- Claude Code Sonnet — default, good for most tasks
- Codex GPT-5.4 — OpenAI's code model, used for review
- Claude API Direct — raw API call, no container overhead
- OpenRouter Passthrough — route through OpenRouter for model flexibility
- Streaming Container — VNC-enabled container with desktop access
- BYOVM Dispatch — send work to a registered bare-metal machine
- Local Shell — run on the host (dev only, never production)
- Dry Run — log what would happen without executing
- Composite — chain multiple engines for multi-step tasks
- Smart Selection — pick the best engine based on task complexity
Smart selection is the interesting one. It analyzes the task description, estimates complexity (file count, test requirements, cross-repo impact), and picks the cheapest engine that can handle it. Simple typo fix? Claude API Direct, no container. Multi-file refactor? Claude Code Sonnet in Docker. Architecture change? Claude Code Opus with extended context.
4. First autopilot PR on a non-platform repo
Margin Mandy PR #5 was the moment it became real. The autopilot pipeline — running against its-boris/marginmandy instead of Curate-Me-ai/platform — checked out the repo, analyzed the codebase, and created a PR touching 23 files. Frontend modernization: updated component structure, cleaned up styles, added responsive layout fixes.
It was not perfect. The PR needed manual cleanup afterward. But it proved the architecture: per-org config routes the right repo to the right container, the container clones the right code, Claude Code works on it, and the PR lands on the right repo. None of that was hardcoded.
5. Margin Mandy: from Azure to Docker in one day
The marginmandy repo was an old Azure Static Web App with a Contentful CMS backend. In one day, we merged 14 PRs that:
Margin Mandy transformation — March 24
The content pivot is worth noting. The original Margin Mandy content was generic. The rewrite focused on publicly available datasets — CMS hospital cost data, DOGE spending records — because real data is more interesting than opinions. The "15 Free Datasets" post uses real URLs to real government data portals. No synthetic content.
6. Dashboard UI redesign
The dev-team dashboard page was a mess. Full issue bodies rendered as raw text in kanban cards. Thirteen duplicate "Failed" cards from test runs cluttering the Done column. No visual hierarchy.
We shipped a professional redesign: compact task cards (title + status badge + cost pill + PR link), clean Ready Issues panel, grouped Done column, AI advisor complexity badges, and a live VNC streaming viewer for watching container executions in real-time. The page went from prototype to product.
7. Shared ContainerExecutor
The autopilot worker (worker.py), fleet executor, and BYOVM dispatch all spin up Docker containers. They each had their own container management code — slightly different, slightly buggy in different ways. We extracted a shared ContainerExecutor utility and migrated all three. One place to fix Docker bugs, one place to add features.
What broke
"No path specified" — Mode 2 orchestrator bug
The dev-team orchestrator has two modes: Mode 1 (full decomposition into subtasks) and Mode 2 (single-shot, send the whole task to one container). Mode 2 was passing the task description to Claude Code but not the repo path. Claude Code would start, see no repository, and fail with a cryptic "no path specified" error.
The fix was two lines — pass repo_url through the Mode 2 code path — but finding it took an hour because the error appeared inside the Docker container's logs, not in our application logs. Container-internal errors are always harder to debug.
NoneType error on successful PRs
Tasks were being marked as "failed" even when they successfully created PRs. The post-execution handler tried to access result.pr_url but result was None because the worker returned a status dict, not the autopilot result object. The PR existed on GitHub. The dashboard said "failed." Users would see red when they should see green.
This is a type-safety problem. Python's duck typing meant the worker could return either a dict or an object, and nobody complained until the handler tried to access an attribute on a dict. We added null-safe handling and explicit type checks, but the real fix is the WorkerEngineProtocol — all engines now return the same result type.
OpenClaw gateway blocking Claude Code
Our runner containers include OpenClaw, which starts a local gateway on port 18789. Claude Code also needs network access to reach the LLM API. Problem: the OpenClaw gateway was intercepting outbound requests and trying to route them through its own provider system, which doesn't have our API keys configured.
The fix: set OPENCLAW_SKIP_GATEWAY=1 in containers that run Claude Code CLI tasks. OpenClaw's gateway is useful for interactive agent sessions but counterproductive for headless code generation. This took three attempts to get right because there are multiple environment variables that control different parts of the OpenClaw startup sequence.
Container streaming: 30+ seconds to start
Streaming containers (the ones with VNC for live viewing) were taking 30+ seconds to become ready. For a two-line code fix, that is an absurd overhead. The delay: Xvfb startup, x11vnc negotiation, websockify bridge, then OpenClaw gateway initialization, then Claude Code.
We did not fully solve this. The VNC stack is inherently heavy. What we did: made streaming optional per-task (most tasks do not need it), and for non-streaming tasks, skipped the entire VNC stack. Non-streaming containers start in under 5 seconds.
Git auth for private repos on VPS
When the autopilot container on the VPS tried to clone its-boris/marginmandy (a private repo), it failed. The GitHub token was configured for Curate-Me-ai/platform but not for the user's personal repos. Per-org config means per-org credentials, and we had not wired the GitHub token through the container environment for non-platform repos.
The fix: each org's github_config now includes a github_token field, and the container executor injects it as GITHUB_TOKEN in the container environment. The token needs repo scope for the target repo. This seems obvious in retrospect but was not part of the original design because the original design assumed one repo.
What we learned
Per-org config is essential from day one
We built the entire pipeline assuming one repo, one org, one deploy target. Making it multi-org required touching 15+ files: routes, services, pipeline steps, board queries, deploy templates. If we had started with org_id as a first-class parameter everywhere, the multi-repo work would have been a configuration change instead of a refactor.
This is a known pattern in B2B SaaS — multi-tenancy is always harder to add later than to build in — but knowing it and feeling it are different things.
Do not mount auth files into Docker containers
Our first approach to giving containers access to the Anthropic API was mounting ~/.claude/auth.json into the container. This broke in multiple ways: the file path differs across machines, Docker-in-Docker has different mount semantics, and the OAuth token in the file expires. The working approach: pass ANTHROPIC_API_KEY as an environment variable set to the OAuth token, refreshed at container start time.
Environment variables are less elegant than file mounts. They are also more portable, more debuggable, and do not break when you move from local Docker to VPS Docker to Kubernetes.
Public datasets beat proprietary data for blog content
The Margin Mandy content rewrite was revealing. The original posts were generic advice pieces — "5 Tips for Hospital CFOs." The rewrite focused on specific, publicly available datasets: CMS Hospital Cost Report data, DOGE federal spending records, Medicare Provider Utilization files. Posts backed by real, free, downloadable data got better engagement in our testing and were faster to write because the data provides the structure.
For anyone building content as a reference app: find the public datasets in your niche first, then write about them. The data does the heavy lifting.
Reference apps sell better than feature pages
We now have three reference apps: the curate-me.ai dashboard (the product itself), its-boris.com (this blog, deployed via the pipeline), and marginmandy.com (hospital CFO blog, also deployed via the pipeline). Each one demonstrates a different aspect of the platform:
- Dashboard: governance, cost tracking, runner management
- its-boris.com: AI-authored content pipeline, deploy automation
- marginmandy.com: multi-repo support, template-based deploys, domain-specific content generation
Nobody reads feature comparison tables. But "here is a live site that was built and deployed by the AI dev team pipeline" is a demo that sells itself.
Cost: $0.00
Every LLM operation this week ran through either the Anthropic Max subscription (Claude Opus/Sonnet/Haiku at $0 marginal cost) or ChatGPT Plus (GPT-5.4 for review, also $0 marginal). The gateway tracked 2,340 governed requests and 695K tokens processed, all at zero incremental cost.
This is the advantage of routing through your own gateway with BYOK (bring-your-own-key) support. The subscription covers the tokens. The gateway provides the governance. The cost dashboard shows exactly what would have been spent at API rates — useful for understanding the economics even when you are not paying per-token.
What is next
The pipeline works across three repos. The deploy automation works for two of them (platform and blog). The remaining gaps:
- Margin Mandy deploy template — the blog_ssh template needs to be configured for marginmandy.com
- Auto-merge on review approval — the pipeline creates PRs and posts reviews, but merge is still manual. The webhook handler exists but is not wired to the merge step.
- Smart engine selection tuning — the complexity analyzer over-estimates small tasks, sending them to Opus when Haiku would suffice. Needs calibration data from real task completions.
- Board integration for Margin Mandy — per-org board config exists but is not yet seeded for the marginmandy org.
The goal for next week: close the loop completely. Issue created on any configured repo, automatically decomposed, coded, reviewed, merged, deployed, and closed. Zero human intervention as the default, with human-in-the-loop as a configurable gate at any step.
This post was co-authored by Claude Code (Opus 4.6). All numbers come from real git history (git log --since="2026-03-19"), real GitHub PR data (gh pr list --state merged), and real MongoDB records. The bugs described were real bugs that caused real failures in real pipeline runs.
Previous: From GitHub Issue to Production Deploy in 3 Minutes — the full architecture deep-dive on the pipeline itself.
Rate this post
Comments
Loading comments...