Iterative Refinement: How AI Reviews AI

The quality problem

AI-generated content has a quality ceiling. A single-shot draft from even the best model often needs revision — the structure might be off, examples might be weak, or the tone might not match what you want.

The standard fix is human editing. But if you're running a content pipeline with daily output, you need something more scalable.

This blog uses an iterative refinement loop where one model writes and another reviews:

1. Step 3.5 Flash writes the initial draft
2. Claude Sonnet reviews it → scores 1-10 with specific feedback
3. If score < 7.5 (threshold): send feedback to Step 3.5 Flash for revision
4. Step 3.5 Flash revises with the reviewer's notes
5. Claude Sonnet reviews again
6. Repeat until score ≥ 7.5 or max iterations (3) reached
7. Final draft goes to Slack for human approval

The key insight: the writer and reviewer are different models. This avoids the echo chamber problem where a model reviews its own work and always thinks it's great.

What the reviewer scores

The reviewer model evaluates each draft on specific criteria and returns structured feedback:

{
  "score": 7.2,
  "strengths": [
    "Clear explanation of the gateway architecture",
    "Good use of concrete examples"
  ],
  "improvements": [
    "Opening is too generic — start with a specific scenario",
    "Missing comparison to alternative approaches",
    "Code example on line 45 has a syntax error"
  ],
  "revisedPrompt": "Rewrite the opening paragraph to start with..."
}

The revisedPrompt is especially important — it gives the writer model specific, actionable instructions for the next iteration, not just vague feedback.

Score progression

Here's what a typical refinement session looks like:

| Iteration | Score | Key feedback | |-----------|-------|-------------| | 1 | 5.5 | "Structure is unclear, missing concrete examples" | | 2 | 7.1 | "Much better structure, but opening is still weak" | | 3 | 8.2 | "Converged — strong opening, good examples, clear flow" |

The agents page shows these score progressions in real time for active refinement sessions.

Configuration

The refinement loop is fully configurable through the curate-me.ai gateway:

Reviewer model: Which model scores the drafts (default: Claude Sonnet 4.6)
Quality threshold: Minimum score to accept (default: 7.5)
Max iterations: Safety limit on refinement rounds (default: 3)
Min iterations: Force at least N rounds even if score is high (default: 1)
Enabled/disabled: Toggle the entire loop on or off

You can change all of these from the fleet config panel on the agents page.

Why not just use a better model?

You could skip refinement and use the best available model for writing. But there are tradeoffs:

Cost — Step 3.5 Flash is significantly cheaper than Claude Opus. Three iterations of Step 3.5 Flash + one Sonnet review costs less than a single Opus draft.
Diversity — Different models have different strengths. Step 3.5 Flash writes fluently; Claude catches logical gaps. Combined, they produce better output than either alone.
Auditability — The score progression creates a paper trail. You can see exactly how a draft evolved, what feedback was given, and whether the quality threshold was met.

The human still decides

Refinement doesn't replace human judgment. It raises the floor. By the time a draft reaches Slack for Boris's review, it's already passed a minimum quality bar. This means less time editing and more time making editorial decisions about what to publish.

The loop is one piece of the larger content pipeline — from research to writing to refinement to approval to publishing. Each stage adds a layer of quality control, and each stage is visible on the agents page.