AutoResearch: How We Use AI to Improve Our Own Platform

Two weeks ago, Andrej Karpathy released AutoResearch — a 630-line Python script that ran 700 experiments in 2 days and found 20 optimizations. Shopify's CEO let it run overnight on internal data: 37 experiments, 19% performance gain.

We asked: what if we pointed this at our own product?

The Experiment

We built an AutoResearch pipeline that runs real experiments on the Curate-Me B2B dashboard — a Next.js 15 app with 229 pages and 1,100+ source files. Not a toy benchmark. The actual product.

Every experiment follows the Karpathy Loop:

Read current state — check what the last experiment achieved
Hypothesize — propose a single, targeted change
Modify — make the code change
Evaluate — run the eval command, extract the metric
Decide — if the metric improved, commit. If not, revert.
Repeat

The key insight: each experiment writes its learnings to a knowledge store. Future experiments read those learnings before starting. The system gets smarter with every run.

What We Actually Found

We ran experiments across 6 metrics. Here's what the dashboard baselines looked like:

| Metric | Baseline | What We Found | |--------|----------|---------------| | lucide-react imports | 4 files (vs 690 Phosphor files) | 99.4% of the codebase already used Phosphor. 4 holdout files from an early template. | | @monaco-editor bundle | 1 static import, 380KB gzipped | Used in exactly 1 file. Dynamic import saves 180KB from initial load. | | @xyflow/react bundle | 27 files, all static imports | Only used in workflow builder. Already behind a dynamic boundary at page level. | | Test coverage | 4% line threshold | Shocking. 162 test files for 1,106 source files (14.6% ratio). | | Playwright workers | 1 (sequential) | Config comments suggest previous OOM issues. Safe to increase to 2-3. | | TypeScript any types | 198 in source | 70% are in test files. Only 15-20 non-test instances. |

The first three experiments alone would reduce the initial bundle by 233KB. That's 500ms faster on 3G.

The Governance Layer

Here's what makes this different from running a script locally: every LLM call in the experiment pipeline flows through our gateway.

The gateway adds 6 governance checks before each request reaches the LLM provider:

Rate limit — 100 RPM per org
Cost estimate — BPE tokenization predicts the cost before tokens are consumed
PII scan — regex scan for secrets and personally identifiable information
Security scanner — prompt injection and jailbreak detection
Model allowlist — enforce which models each org can use
HITL gate — flag high-cost requests for human approval

Total governance latency: 3.1ms. Every experiment is cost-tracked, auditable, and governed.

The Pipeline: Blog to PR

The full chain looks like this:

Blog trigger button
  → Blog API route (/api/demos/autoresearch/trigger)
    → Gateway autopilot endpoint (/api/v1/autopilot/run/dev_team)
      → Docker container starts (openclaw-base image)
        → Claude Code CLI runs the experiment
          → Git changes committed
            → GitHub PR created
              → Results stored in MongoDB
                → Blog experiment archive updated

We triggered experiments directly from its-boris.com/demos/autoresearch — no B2B dashboard access needed. Click a button, watch a real experiment execute, see the PR appear.

4 PRs created in our first session:

PR #1413 — E2E pipeline verification
PR #1414 — Remove lucide-react imports
PR #1415 — Blog file inventory
PR #1416 — Add smoke tests for demo pages

Total cost: $1.35 for 20 experiments. Compare that to a senior engineer spending 4 hours on the same work.

Self-Improving Agents

The experiments page isn't the only new demo. We also built Night Owl — a daily AI news digest agent inspired by the self-improving-agent skill (1,100+ stars on ClawHub, 90K+ downloads).

Night Owl runs daily at 8 AM UTC. Before scanning news, it reads its knowledge base — past digests, topic engagement data, source quality scores, reader feedback. After writing the digest, it records what it learned.

The quality trend chart on the demo page shows the agent improving over time. Readers can rate each digest with a star rating that feeds directly back into the learning loop.

This is the pattern no competitor offers: governance + learning in one platform. Agents that get smarter over time, governed and observable.

What We Built (Technical)

In one session, we shipped:

3 new demo pages — AutoResearch, Night Owl, Runner Spotlight
8 API routes — experiment archive, SSE stream proxy, on-demand triggers, feedback endpoints
4 patent components — Cost Governance, MCP Self-Governance, HITL Auto-Replay, Audit Chain Viewer
Navigation overhaul — dropdown menus surfacing 9+ hidden pages
Learning dashboard — cross-agent knowledge metrics on the platform page
7 infrastructure fixes — BYOVM executor attributes, agent token auth, model ID format, MongoDB client, internal URLs, stale containers

The blog at its-boris.com went from 4 nav links hiding 9 pages to a comprehensive showcase with 10 interactive demos, all powered by the Curate-Me gateway.

Try It

The AutoResearch demo is live at its-boris.com/demos/autoresearch. You can see the experiment archive with real data, trigger new experiments on demand, and watch the knowledge base grow.

The Night Owl cron agent is at its-boris.com/demos/cron. The Runner Spotlight is at its-boris.com/demos/runner.

Every experiment, every digest, every runner task flows through the governance gateway. Every LLM call is cost-tracked. Every result is auditable.

We're using our own platform to improve our own platform. That's the demo.