They build evaluation workflows with rubrics, examples, and judges to scale quality standards
Chapter Progress: Early Draft
You have met a smaller version of this move already. The chapter on building evaluation loops before scaling delegation taught you to check one AI output against your standard before you trust it. That check was you, reading one result. A judge is that same check, written down clearly enough that a second AI session can run it for you, again and again. The rest of this chapter is how to write the check that well: with a rubric, with examples of good and bad work, and with the close cases that show exactly where your line sits.
One concrete task runs through the sections that follow: getting AI to write you a useful morning brief, a short rundown of your day so you can plan it. As of 2026 that looks like typing into a chat model, and the interface will keep changing under it, to voice, then to glasses, then to a brain-computer interface, then to whatever comes after. The judging skill survives every one of those shifts. Suppose your brief now runs every morning on its own, and on most days it is good. The problem is the days you do not catch. You are not reading each brief against your standard anymore, so a brief that buries your one hard meeting under three trivial reminders slides through. A judge is what reads each brief for you and flags the ones that miss.
Make your standard concrete enough for a second session to apply
A judge can only apply a standard you have made concrete, which is the step most people skip. A standard you can feel but cannot state will not survive being handed to another session. When you read a weak brief and think 'this one is off,' you are running a rich, mostly wordless judgment. The judge has none of that. It has only what you wrote down. So the real task of building a judge is turning your felt sense of good into something explicit enough that a separate session reaches the same verdict you would.
Concretely, that explicit standard has four parts, and they reinforce one another. You name the dimensions that matter, you show what passing looks like, you show what failing looks like, and you mark where the line between them sits. A rubric carries the dimensions. Gold examples carry passing. Rejected examples carry failing. Boundary cases carry the line. The next four sections build each part on the morning brief, and you will see why all four are needed: drop any one and the judge starts guessing in a place you could have told it the answer.
Write a rubric that names the dimensions quality turns on
Start by naming what you are even measuring. A rubric is the short list of dimensions that decide whether an output is good for this specific task. For the morning brief, the dimensions might be: it reads your actual calendar instead of inventing a schedule, it leads with the day's hardest or highest-stakes item, it stays short enough to read in under a minute, and it flags anything you said you were dreading. Each line names one dimension and says, in a sentence, what strong looks like there and what weak looks like. That paired contrast is what gives the judge a place to put each output on a scale and produce a verdict.
Keep the rubric tied to this one task. A rubric for morning briefs measures different things than a rubric for research summaries or client emails, so a borrowed rubric quietly drags in the wrong dimensions. If you built a rubric while raising your standards or while , reuse it here; that is the same standard, now doing a second job. The chapters on rejecting AI and on both produce rubrics worth keeping. If you have none yet, ask AI to draft one from a few examples, then correct it until it names the dimensions you care about.
Show passing with gold examples so the judge sees the target, not a description of it
Words tell the judge what you want; an example shows it. A gold example is an output you would accept without changes, handed to the judge as a reference point for passing. Include at least two, because one example can be read as a fluke while two start to mark a pattern. When the judge evaluates a new brief, it has something concrete to reason from: does this brief achieve what the gold ones achieved? Two strong briefs of yours, both labeled as passing, teach more than a paragraph describing the ideal brief, because they demonstrate the standard instead of gesturing at it.
Show failing with rejected examples so polish alone cannot pass
A judge also needs to see what you turn down, and why. A rejected example is an output you consider weak, paired with a short note naming what made it weak. Include at least two. The judge reads these as anti-patterns: if a new brief shares the qualities that sank a rejected one, it should fail. The note is what carries the lesson. A brief that reads smoothly but quietly invented two meetings is exactly the case a description would miss and an annotated rejected example catches, because it teaches the judge that polish is not the standard and accuracy is.
Mark the line with a boundary case so close calls do not coin-flip
The hardest outputs to judge are the ones that sit right on the line, and those are the ones that reveal whether your judge is calibrated. A boundary case is an output close to the pass or fail line, with your own call written down and the reason for it. Include at least one. A brief that names your real meetings but buries the hard one in the middle might be a pass on a light day and a fail on a heavy one; whichever way you call it, saying why teaches the judge where the threshold really sits. Boundary cases are how you check that the judge agrees with you on the close calls, not only the obvious ones.
Calibrate the judge against your own calls before you trust it
A fresh judge is a guess about your standard, not yet proof. Treat the judge as a hypothesis you test against cases you have already decided. Run it across outputs you have personally scored, gold and rejected and boundary together, and compare its verdict to yours on each one. Where it agrees, good. Where it disagrees, the disagreement is the information: it points at a criterion that was too vague, a missing example, or a line you never drew clearly. You can hand the disagreement back to the judge and ask it to name which criterion or missing example produced the miss, then adjust the rubric, add the example, or sharpen the wording and run it again.
Use a concrete agreement bar so you know when to stop tuning and start trusting. A reasonable starting rule of thumb is to keep adjusting until the judge agrees with your own calls on at least eighty percent of cases before you let it review new work. Treat that figure as a place to begin, not a law, and raise it for higher-stakes tasks where a missed flag costs more. Below your chosen bar, the judge is still applying a standard close to yours but not yet yours, and the gaps will show up as work it waves through that you would have flagged. Once the morning-brief judge clears the bar on your scored briefs, you can let it watch tomorrow's brief instead of reading it yourself.
Fold the calibrated judge into the workflow so future runs inherit it
A calibrated judge is too valuable to leave as a one-time test. Once the judge agrees with you, fold it into the instructions for the recurring task so every future run is checked against the same rubric and examples without you wiring it up again. The judge stops being a thing you ran once and becomes a standing reviewer that travels with the work. For the morning brief, that means the brief now generates and gets judged in one workflow, and you see the verdict instead of re-reading the brief from scratch each day. You have just made an abstraction jump: a standard you held by feel is now an automated check riding inside the workflow, one level higher than where you started, and your attention is free for the next thing.
From there you evolve the system rather than rerun it. When a new kind of bad output slips past the judge, the system can surface that escape as a fresh example: you can ask the judge to analyze its own misses, add the case, update the rubric, and recalibrate. This is also a place to play: chase a 'wouldn't it be cool if the judge could catch this' and try teaching it a subtler quality you used to spot only by feel, then see whether it holds on your scored cases. The workflow that produces your work and the judge that guards it both get sharper over time, each catching what the other learns, and the standard itself keeps climbing. A judge built this way is never finished, and that is the point: it grows with your standard rather than freezing it. The next chapter, on keeping current as models change, takes up the question this raises, which is how to keep a folded-in judge from drifting out of agreement with you as the ground keeps moving.
