Meli AI Mentoring

4.4

They build evaluation workflows with rubrics, examples, and judges to scale quality standards

Chapter Progress: Early Draft

Chapter Progress

Agent instructionsPaid book

A judge lets one trusted reviewer check work you could never read by hand

There is a moment in working with AI where the work outgrows your eyes. Early on you read every output, catch what is off, and fix it. Then you start delegating more, and the volume climbs past what one person can read. Once you stop reading each output, quality tends to drift back toward whatever the model produces on its own. The careful standard you built lives in your head, and your head is no longer in the loop on most of the work.

A judge is how you put your standard back in the loop without reading everything yourself. The idea is plain: you set up a second AI session whose only job is to look at one output and check it against what you decided good looks like. You teach the standard once, and the judge applies it to every output you point it at. It never gets tired on the fortieth draft, and it never quietly lowers the bar because the day got long. This chapter is about building that judge so it checks the work the way you would, and then folding it into the tasks you run again and again.

Hand-drawn judge calibration loop showing gold and rejected examples feeding a rubric, a judge producing a verdict, and calibration looping back. — A judge scales your standard only after gold examples, rejected examples, and a rubric teach it where the boundary falls.

You have met a smaller version of this move already. The chapter on building evaluation loops before scaling delegation taught you to check one AI output against your standard before you trust it. That check was you, reading one result. A judge is that same check, written down clearly enough that a second AI session can run it for you, again and again. The rest of this chapter is how to write the check that well: with a rubric, with examples of good and bad work, and with the close cases that show exactly where your line sits.

One concrete task runs through the sections that follow: getting AI to write you a useful morning brief, a short rundown of your day so you can plan it. As of 2026 that looks like typing into a chat model, and the interface will keep changing under it, to voice, then to glasses, then to a brain-computer interface, then to whatever comes after. The judging skill survives every one of those shifts. Suppose your brief now runs every morning on its own, and on most days it is good. The problem is the days you do not catch. You are not reading each brief against your standard anymore, so a brief that buries your one hard meeting under three trivial reminders slides through. A judge is what reads each brief for you and flags the ones that miss.

Make your standard concrete enough for a second session to apply

A judge can only apply a standard you have made concrete, which is the step most people skip. A standard you can feel but cannot state will not survive being handed to another session. When you read a weak brief and think 'this one is off,' you are running a rich, mostly wordless judgment. The judge has none of that. It has only what you wrote down. So the real task of building a judge is turning your felt sense of good into something explicit enough that a separate session reaches the same verdict you would.

Concretely, that explicit standard has four parts, and they reinforce one another. You name the dimensions that matter, you show what passing looks like, you show what failing looks like, and you mark where the line between them sits. A rubric carries the dimensions. Gold examples carry passing. Rejected examples carry failing. Boundary cases carry the line. The next four sections build each part on the morning brief, and you will see why all four are needed: drop any one and the judge starts guessing in a place you could have told it the answer.

Showing the judge graded examples calibrates it more than describing your standard

There is a reason examples calibrate a judge more than careful descriptions do. A description like 'lead with what matters most' sounds precise, but the words leave huge room. The model fills that room with its own prior of what a good brief looks like, which is an average over everything it has read, not your specific taste. An example collapses the room: it pins one concrete output to one verdict you made, so the judge can reason from a real case rather than from a generic prior.

This is also where a quiet failure hides. A polished, confident output can read as good to a judge that only has verbal criteria, because surface fluency is exactly what a generic prior rewards. Showing the judge a rejected example that is fluent and wrong is how you teach it to look past the surface. If you only ever describe your standard, the judge tends to inherit the same blind spots you were trying to escape.

Write a rubric that names the dimensions quality turns on

Start by naming what you are even measuring. A rubric is the short list of dimensions that decide whether an output is good for this specific task. For the morning brief, the dimensions might be: it reads your actual calendar instead of inventing a schedule, it leads with the day's hardest or highest-stakes item, it stays short enough to read in under a minute, and it flags anything you said you were dreading. Each line names one dimension and says, in a sentence, what strong looks like there and what weak looks like. That paired contrast is what gives the judge a place to put each output on a scale and produce a verdict.

Keep the rubric tied to this one task. A rubric for morning briefs measures different things than a rubric for research summaries or client emails, so a borrowed rubric quietly drags in the wrong dimensions. If you built a rubric while raising your standards or while , reuse it here; that is the same standard, now doing a second job. The chapters on rejecting AI and on both produce rubrics worth keeping. If you have none yet, ask AI to draft one from a few examples, then correct it until it names the dimensions you care about.

Show passing with gold examples so the judge sees the target, not a description of it

Words tell the judge what you want; an example shows it. A gold example is an output you would accept without changes, handed to the judge as a reference point for passing. Include at least two, because one example can be read as a fluke while two start to mark a pattern. When the judge evaluates a new brief, it has something concrete to reason from: does this brief achieve what the gold ones achieved? Two strong briefs of yours, both labeled as passing, teach more than a paragraph describing the ideal brief, because they demonstrate the standard instead of gesturing at it.

Show failing with rejected examples so polish alone cannot pass

A judge also needs to see what you turn down, and why. A rejected example is an output you consider weak, paired with a short note naming what made it weak. Include at least two. The judge reads these as anti-patterns: if a new brief shares the qualities that sank a rejected one, it should fail. The note is what carries the lesson. A brief that reads smoothly but quietly invented two meetings is exactly the case a description would miss and an annotated rejected example catches, because it teaches the judge that polish is not the standard and accuracy is.

Mark the line with a boundary case so close calls do not coin-flip

The hardest outputs to judge are the ones that sit right on the line, and those are the ones that reveal whether your judge is calibrated. A boundary case is an output close to the pass or fail line, with your own call written down and the reason for it. Include at least one. A brief that names your real meetings but buries the hard one in the middle might be a pass on a light day and a fail on a heavy one; whichever way you call it, saying why teaches the judge where the threshold really sits. Boundary cases are how you check that the judge agrees with you on the close calls, not only the obvious ones.

Calibrate the judge against your own calls before you trust it

A fresh judge is a guess about your standard, not yet proof. Treat the judge as a hypothesis you test against cases you have already decided. Run it across outputs you have personally scored, gold and rejected and boundary together, and compare its verdict to yours on each one. Where it agrees, good. Where it disagrees, the disagreement is the information: it points at a criterion that was too vague, a missing example, or a line you never drew clearly. You can hand the disagreement back to the judge and ask it to name which criterion or missing example produced the miss, then adjust the rubric, add the example, or sharpen the wording and run it again.

Use a concrete agreement bar so you know when to stop tuning and start trusting. A reasonable starting rule of thumb is to keep adjusting until the judge agrees with your own calls on at least eighty percent of cases before you let it review new work. Treat that figure as a place to begin, not a law, and raise it for higher-stakes tasks where a missed flag costs more. Below your chosen bar, the judge is still applying a standard close to yours but not yet yours, and the gaps will show up as work it waves through that you would have flagged. Once the morning-brief judge clears the bar on your scored briefs, you can let it watch tomorrow's brief instead of reading it yourself.

A judge and its generator can share the same blind spot

When you can, have a different model judge the output than the one that wrote it. The reason is correlated blind spots: a model tends to approve patterns that look like its own work, so a model grading itself can wave through the very habits you were trying to catch. A second, independent model is less likely to share that exact bias, so its disagreements are more informative.

As of 2026 this looks like generating with one chat model and judging with another from a different maker, and the specific names will keep changing while the move stays the same. If you generate with Claude, you might judge with ChatGPT, or the reverse. When you are stuck inside one provider, reach for a different rather than the identical model. The durable point survives every rename: independence in the judge is worth seeking, because a grader that thinks exactly like the writer cannot see what the writer missed.

Fold the calibrated judge into the workflow so future runs inherit it

A calibrated judge is too valuable to leave as a one-time test. Once the judge agrees with you, fold it into the instructions for the recurring task so every future run is checked against the same rubric and examples without you wiring it up again. The judge stops being a thing you ran once and becomes a standing reviewer that travels with the work. For the morning brief, that means the brief now generates and gets judged in one workflow, and you see the verdict instead of re-reading the brief from scratch each day. You have just made an abstraction jump: a standard you held by feel is now an automated check riding inside the workflow, one level higher than where you started, and your attention is free for the next thing.

From there you evolve the system rather than rerun it. When a new kind of bad output slips past the judge, the system can surface that escape as a fresh example: you can ask the judge to analyze its own misses, add the case, update the rubric, and recalibrate. This is also a place to play: chase a 'wouldn't it be cool if the judge could catch this' and try teaching it a subtler quality you used to spot only by feel, then see whether it holds on your scored cases. The workflow that produces your work and the judge that guards it both get sharper over time, each catching what the other learns, and the standard itself keeps climbing. A judge built this way is never finished, and that is the point: it grows with your standard rather than freezing it. The next chapter, on keeping current as models change, takes up the question this raises, which is how to keep a folded-in judge from drifting out of agreement with you as the ground keeps moving.

Build and calibrate one judge on your own examples

Build a calibrated judge for a recurring workflowPaid book · Claude reads your own example outputs, drafts a rubric and a self-contained judge prompt, and runs it against your examples so you can see where it agrees with you and refine until it matches.