Meli AI Mentoring

1.9

They build evaluation loops before they scale delegation

Chapter Progress: Early Draft

Chapter Progress

Agent instructionsPaid book

refers to the correspondence between a person's trust in the automation and the automation's capabilities.

— John D. Lee and Katrina A. See (2004) — Trust in Automation: Designing for Appropriate Reliance, Human Factors

Match how much you verify to what an error would cost

Think about how you come to trust a new coworker. After they nail a few summaries, you stop rereading their summaries line by line. After they once send a client the wrong number, you double-check anything with numbers in it for a while. Your trust tracks what they have shown they can do on that specific kind of task, not how sure they sound and not your warm overall feeling about them. You are running a quiet feedback loop: each result they hand back nudges how closely you watch the next one.

AI makes that loop harder to run than the coworker does, because its confidence does not track its competence. The model can do some tasks well and fail just as fluently on others that look equally hard from where you sit. A confident wrong answer reads exactly like a confident right one, and when trust drifts toward whatever sounds fluent, a uniformly fluent system hides the line between what it gets right and what it gets wrong. Evaluation is how you find that line for your own work: you check the output against your standards before you rely on it, and you check harder where being wrong would cost more.

Hand-drawn layered evaluation stack showing AI output passing through human judgment, a rubric, and a judge before reaching a trust decision scaled by stakes. — Evaluation is not suspicion or trust; it is deciding how much checking this output deserves.

Naming what you just felt makes it usable. Researchers who study people working alongside automated systems call the coworker move trust : matching how much you rely on a system to the competence it has shown for the task at hand, not to how sure it sounds (John D. Lee and Katrina A. See, 2004). The same idea, stated as a definition you can keep, looks like this.

The reason takes deliberate effort with AI has a name worth keeping. The boundary between tasks AI handles well and tasks it fails is not a smooth line you can predict from difficulty. It is jagged: irregular, specific to the task, and invisible until you test it. Two requests that look equally hard to you can land on opposite sides of it, and the boundary moves every time the model updates.

The practical rule that falls out of a is a pacing rule. Scale how much you delegate only as fast as you scale how much you check. Handing the system more of a task is safe only to the degree the checking around it keeps up, because the place a model fails can be the very place that looks safe from the outside. The next sections build the checking that lets delegation grow without outrunning your ability to catch a confident mistake.

Three levels of checking build on each other

The way to make checking keep pace with delegation is to make your judgment explicit enough to apply the same way twice. The three levels are one judgment captured at three depths: your own read, a written rubric, and a judge prompt. Each level takes the judgment from the level before it and makes it portable, so the same standard reaches more output without you rereading every piece by hand. They build on each other in order, and you only climb to the next level when the task earns it.

Start every evaluation with your own read, and let it sharpen

The first level is your own taste, expertise, and standards. You read the output and ask the plain questions: does this meet my bar, is it specific enough, is it true, would I put my name on it. Everything above this level depends on it, because a rubric and a judge can only carry a judgment you were able to make in the first place. You stay accountable for the call, and you take the better read whether it came from you or from the judge prompt you built from your own read.

Your read gets more precise with use, and that precision is what the higher levels capture. The quality-ratchet chapter teaches you to sharpen it by naming the specific patterns of weak output you keep rejecting. As you name them, you can say which criterion an output violates and where the gap sits, instead of only feeling that something is off. That named precision is exactly what you later hand to a rubric and a judge, so the first level never stops improving even as the work moves up to the next two.

Write a rubric to make your standard explicit and portable

The second level is a rubric: a written set of criteria that says what good looks like for one specific task. A rubric turns a standard you have been carrying in your head into words, and once it is words it travels. You can paste it into a prompt, hand it to a colleague, or feed it to a judge prompt so the same criteria reach many outputs instead of living only in your attention on the one in front of you.

Here is the honest version of what a rubric buys you. A well-written rubric, applied by a judge prompt, holds your criteria more consistently than a tired person rereading the tenth draft at the end of the day, but only inside the boundaries you have written down and tested. It will not catch a failure you never named, and it will not notice a new kind of error that shows up after a model update. Consistency inside known boundaries is the thing it gives you, and it is worth a great deal, as long as you do not mistake it for coverage you have not built.

A rubric does not need to be elaborate. Three to five criteria are usually enough. Each criterion names one quality dimension, such as accuracy, specificity, tone, structure, or evidence, then describes what strong output looks like on that dimension and what weak output looks like. You do not write it from a blank page: the chapter shows how to ask AI to draft the rubric from a few good and bad examples, after which you correct it where it missed the distinction you care about.

Run a judge prompt in its own conversation to apply the rubric at scale

The third level uses one AI conversation to evaluate another's output. A judge prompt is a separate prompt whose only job is to score an output against your rubric. You give it the output, the rubric, and examples of strong and weak results, then ask it to rate each criterion and say why. Run it in its own conversation so it judges fresh, instead of defending work it just produced in the same thread. Once your standard is written into the rubric and shown through examples, the judge can apply it across many outputs while your attention shifts to the edge cases it flags, where you keep the final call.

A judge goes wrong in one quiet way: a judge that approves everything has stopped judging. If you only ever show it strong examples, it drifts into a pass-by-default pattern and waves through the next weak output just as cheerfully. The fix is to feed it the rejects: paste in outputs you turned down, each with the reason it failed. A judge that can tell you why a bad output is bad is one you can trust to flag the next one. Test it the way you tested the coworker, by handing it a result you know is wrong and watching whether it catches the error.

Why fluent output and confident output drift apart

For the curious, here is the mechanism under the problem, and you can skip it without losing the practice. A current language model is trained to produce text that reads as fluent and well-formed, and a confident tone is part of what reads as well-formed. Fluency is a property of the writing; correctness is a property of the match between the writing and the world. A fluent, confident tone does not reliably indicate correctness, so a passage can score high on the first while failing the second.

This is why your own confidence, watching the output, is not a guide you can trust. The cue you naturally read, how sure the text sounds, is the cue the system controls best and the one least connected to whether the content is right. A rubric and a judge help because they move your attention off the tone and onto named criteria you can check against the task: is the citation real, is the number sourced, does the tone fit the audience. They do not make the model honest. They give you a place to look that is harder to fake than fluency.

Match the depth of checking to what an error would cost

Checking everything is as wasteful as checking nothing. If you run all three levels on every output, the cost of evaluation swamps the work, and you stop doing it. If you wave everything through, the one confident mistake on the lands where it hurts. Calibrated trust splits the difference by cost, so the careful checking lands where a mistake would hurt the most and the cheap checking covers the rest.

The rule is plain: check hardest where being wrong is expensive, and trust more freely where the task is well-tested and a mistake is cheap to undo. A published client report earns all three levels. A few internal brainstorming notes earn a quick first read and nothing more. The depth of checking is a dial you set by stakes, not a fixed routine you run the same way on everything.

Picture the running example for this chapter to see the dial move. You are shaping software to fit how you live and work, describing what you want while the AI builds it. As of 2026 that looks like typing into a chat model; tomorrow it might be a voice you talk to, glasses that carry world knowledge, earbuds, or a brain-computer interface. The principle holds whatever the interface becomes. Suppose your setup can now do two things for you: draft a reminder to text a friend back, and move money between your accounts when a bill is due. The trap is uniform trust. If both came from the same AI in the same session and both read fluently, it is tempting to wave both through, or to get burned once and start rereading even the reminders.

Calibrated trust splits the two actions by cost. A misfired reminder is a small annoyance you can fix in seconds, so a quick read is enough. A wrong transfer moves money you may not get back, so it earns the rubric, the judge, and a final human confirm before anything executes. Same system, same session, two different amounts of checking, each set by what an error would cost. Where you want the storage and reuse mechanics for the rubrics and standards this produces, the reusable-AI-assets chapter carries them; here the point is only how deep to check.

A rough guide groups outputs into three cost tiers. High-cost outputs (published claims, client deliverables, medical or legal information, anything that moves money) earn all three levels. Middle-cost outputs (meeting summaries, status reports, routine messages) earn a rubric check without a full judge pass. Low-cost outputs (internal drafts, brainstorming, personal notes) earn a quick first read. The later chapters carry this further into choosing which model to use, stress-testing, and verifying sources; this chapter sets the frame.

Feed verdicts back into evolving the system that produced them

Evaluation is the step that turns a verdict into an upgrade. When a judge flags an output as weak, that flag is a clue, not the end of the story. The next move is diagnostic, and it does not have to fall to you alone: you can ask the AI to analyze its own output against the rubric and name which criterion it missed, then which part of the system let the miss through, the prompt, the context it had, the model you chose, or the workflow around it. That feeds back into , which evolves the part that failed into a stronger version of itself, which produces a better output, which the judge checks again. The verdict becomes a change to the system rather than a complaint about one result.

Every serious interaction produces two things: the output you asked for, and evidence about how well the system that produced it is doing on this kind of task. Evaluation is how you read that evidence. It is also how you check whether the rest of your loop still holds: whether the standard you keep rejecting against is catching the right things, whether a prompt you improved last month still produces what you want, whether a saved rubric survived the latest model update. Test whether the kept thing still holds, because a model update can quietly move the boundary under a standard you set when the old model was in place.

That is the compounding loop running with the brakes on. You experiment, look at what came back, and hold it to a standard. Where it falls short, you let an evaluation surface which part of the system produced the miss, improve the conditions that produced it, and save what worked. The culminating step is the one that climbs a level: you evolve the system, handing more of the judgment to the rubric, the judge, or the standard so the next run starts higher than this one did. The output you got is the cheap part; the judgment you wrote into the system is what carries to the next task. The output is temporary; the system you evolve compounds.

Build a three-level evaluation check for one taskPaid book · Claude builds a quick read, a rubric, and a judge prompt for one recurring AI task, then calibrates the judge against your own assessments.

References

1 source

1
Navigating the Jagged Technological Frontier
Dell'Acqua, McFowland, Mollick, et al. · 2023 · Harvard Business School Working Paper 24-013, 2023.
Consultants at Boston Consulting Group were split into three groups: no AI, GPT-4, and GPT-4 with a short how-to overview. On a task that sat inside the frontier, the AI groups did clearly better, completing 12.2% more subtasks, working 25.1% faster, and producing results rated more than 40% higher in quality than the group without AI. On a task built to sit outside the frontier, the pattern flipped: consultants using AI were 19 percentage points more likely to reach a wrong answer than consultants working without it, because the model sounded just as confident on the task it was getting wrong.
The study compared no AI, AI, and AI with an overview. It did not test anyone working through a rubric and a judge. So read the three-level evaluation in this chapter as a discipline extrapolated from this evidence, not as a result the study proved: if uniform confidence is what hurt the outside-frontier group, then a standing way to verify proportionally, instead of trusting evenly, is the defense that fits the problem the study found.