Meli AI Mentoring

4.5

They maintain calibration as models change because every AI opinion has an expiration date

Chapter Progress: Early Draft

Chapter Progress

Agent instructionsPaid book

Essentially, all models are wrong, but some are useful.

— George E. P. Box (1987) — Empirical Model-Building and Response Surfaces

In January you ask AI to summarize a client call, and it nails the tone and catches the one decision that mattered. You learn to trust it for that. So in April you stop reading its summaries closely, because why would you, it has earned the pass. What you cannot see is that a model update in March changed how it handles the part of the call where people talk over each other, and now it sometimes drops the decision and keeps the small talk. The trust you formed in January is still in your head. The behavior it was based on is gone. You are working from a map of a place that has since been rebuilt.

A trust judgment is dated, and model updates expire it without telling you

Earlier chapters showed that AI ability is uneven across tasks: dependable on some, confidently wrong on others that look just as reasonable, with a jagged boundary between the two that you can only find by testing on your own work. This chapter adds the part that gets people: that boundary moves. Every model update can shift it. Some tasks improve. Some regress. Some change in ways too subtle to notice until a result reaches someone who depended on it. Your own standards drift too, sharpening as you gain experience, so even a model that held still would not stay calibrated against you.

So a trust judgment has an expiration date, even though nothing prints the date on it. The judgment that the model handles your client summaries reliably was true on the day you tested it, against the model that existed that day. Models update, prompts drift, tasks evolve, and your bar rises. Left alone, that judgment keeps feeling true long after it stopped being true, and a stale trust judgment is more expensive than a fresh test: it costs you the missed opportunities on tasks that quietly improved, and the preventable failures on tasks that quietly degraded.

The repair is one habit, and the rest of this chapter is that habit in three parts. You keep a dated map of where AI is strong and where it breaks in your own work, and you redraw it on a schedule and after every model update. A small table holds the map. A short list of triggers tells you when to retest. A recurring appointment makes sure the redrawing gets done. None of the three leads. The table without the schedule goes stale, the schedule without the table has nothing to update, and a model update with neither catches you flat. Held together, they keep a living map current with a few minutes of work a month.

Hand-drawn recalibration map showing old calibration interrupted by a model update, a retest calendar, and a refreshed current boundary. — Every model update can move the boundary, so useful trust judgments need dates and retests.

A dated grid turns your scattered trust judgments into a map you can see

Right now your sense of where AI is reliable lives in your head as a loose feeling: this kind of task usually works, that kind needs a second look. The trouble with a feeling is that it has no date and no edges. You cannot tell which parts are fresh and which went stale months ago, because a feeling does not record when you last checked it. The fix is to write the map down in a form that carries dates. A trust grid is a small table, one row per recurring task, that records how reliable the model has been at that task, how much you verify it, and when you last tested it. Putting it on paper does one thing your memory cannot: it shows you, at a glance, which judgments are current and which are running on trust you can no longer account for.

Use the morning brief as the running example, the same one this book threads throughout: getting AI to write you a short rundown of your day so you can plan it, built as one piece of a setup shaped to how you live and work. As of 2026 you build it by describing what you want to a chat model and refining what comes back, and the interface will keep changing under it, to voice, to glasses, to whatever follows. The brief leans on a handful of recurring tasks: reading your calendar, pulling the open items off your task list, drafting a one-line note about the day's hardest meeting. Each of those is a row in the grid. For each row you record the model and tier you use, whether it has been tested in the last 30 days, the failure already observed on that task, and the level of verification you apply before you trust the result. You do not have to catch every failure by eye: you can ask the model to grade a batch of its own past outputs against a short rubric and surface the cases where it slipped, so the failure column fills itself from evidence rather than from memory.

Don't apply this and the grid stays a feeling with no edges. You keep using the brief every morning, sense vaguely that the calendar part is solid and the hard-meeting note is hit-or-miss, and never write either down. A model update lands and you cannot say which rows it touched, because you never had rows. Apply it and each judgment carries a date you can act on. The calendar row says tested two weeks ago, reliable, light verification. The hard-meeting-note row says last tested in February, reliability unknown since, full read-through before you trust it. The second row is now visibly a candidate for your next retest, and you can see that without reconstructing two months of mornings from memory.

A grid that has gone stale is more dangerous than no grid at all

A written grid earns a trust of its own, and that trust is the part to watch. Once a judgment is on paper it reads as settled, so a row that says reliable keeps granting the pass long after the behavior it described has changed. The date column is what protects you from your own table. A row without a recent date is not a verdict you can lean on; it is a question waiting for a retest. Read every undated or stale row as unknown rather than as proven, and the grid stays a map instead of hardening into a set of assumptions you stopped checking.

This is why you do not retest everything constantly: that turns the practice into the work. Retest the rows that earn it, the tasks that are high-value, high-stakes, recently changed, previously fragile, or leaning on a specific model behavior. A check on your five most-used tasks catches drift before it reaches a stakeholder, and leaves the throwaway rows alone.

Match how hard you check to the task to avoid two opposite mistakes

The grid tells you how much a task can be trusted; the next question is what to do with that. The instinct is to pick one verification habit and apply it to everything, but a single setting is wrong in two directions at once. Check everything by hand and you throw away the speed that made AI worth using, and you do it most on the dependable tasks where the checking buys you nothing. Trust everything and a confident error on a high-stakes task reaches someone before you catch it. The setting that avoids both is not a single setting; it is one you tune per task, from evidence. You verify hard where the stakes are high and the track record is thin or unknown, and you trust more freely where the task is well-tested and an error would cost little, and you stay accountable for whatever you ship either way.

Three rows from the morning brief show the dial at three settings. Reformatting your task list into a clean rundown is tested and low-stakes, so you let it run and glance at the result. The one-line note about your hardest meeting is medium-stakes and generally reliable, so you spot-check it against what you know of that meeting. A row that pulls a figure into the brief, say a budget number a decision will hang on, is high-stakes with a known habit of confident errors, so you check the source every time regardless of how good the model has looked lately. Same person, same morning, three different amounts of checking, each one set by the task in front of you rather than by a blanket rule. The prose has built the idea; here is its name.

The grid is what makes a call you can defend instead of a guess. When a row has been tested recently and held up, you lower verification on it and reclaim the time. When a row is new, untested, or carries a history of failure, you raise verification until a fresh test earns it back down. When the stakes are high enough, you verify fully no matter how clean the track record looks, because the cost of being wrong, not the model's average, decides the floor.

After a model update, retest the three kinds of task most likely to have shifted

A model update is the event that can redraw whole regions of your map at once, and it is tempting to respond by retesting everything or by ignoring it and hoping. Both tend to cost you. Retesting everything is the practice eating the work; ignoring it leaves you trusting behavior that may no longer exist. The move is a targeted sweep of the rows most likely to have moved. Three families of task are worth the minutes, and naming them keeps the sweep short. An update is also the most natural moment to play: it is the day to throw a task the old model could never touch at the new one and see what it can now do, because the boundary you mapped last month may have jumped.

The first family is tasks the model used to fail. An update is the most likely reason a task that was out of reach last month is in reach now, so a row marked unreliable is worth a fresh attempt; the boundary may have moved toward you. The second family is tasks the model used to ace. The same update that lifts one task can quietly drop another, and a regression on a task you stopped watching is the kind of failure that reaches a stakeholder before you catch it, so retest your strongest rows precisely because you trust them. The third family is tasks that lean on a specific behavior, a particular format, a tone, a way of handling a tricky input, because a subtle shift in behavior is easy to miss and can break a brief that depended on the old way.

Reusable assets are what let a retest take minutes instead of an afternoon

Targeted retesting only stays cheap if a retest is fast, and what makes it fast is having saved the pieces a test needs. A reusable prompt, a written rubric, a judge prompt, and a few gold examples of the output you want let you rerun a task and grade the result in minutes: run the saved prompt, apply the judge, compare against the gold examples, read the gap. Without those pieces, every retest means rebuilding the test from scratch, and a practice that costs an afternoon per task is a practice you will skip.

The chapters on encoding your expertise that follow are where you build those reusable pieces deliberately. For , the point to carry is narrower: the prompts, rubrics, and gold examples you save while improving a task double as the test rig that keeps your map cheap to redraw.

A recurring appointment is what turns the intention to recalibrate into recalibration

Everything to this point is sound and can still go undone, because recalibration competes with live work and tends to lose to it. An intention to retest soon is how a map goes stale while feeling tended. What converts the intention into the act is a recurring appointment you keep with yourself, the same way you keep a standing meeting. On that cadence you open the grid, retest any row you have not checked in 60 days, ask the model to flag the failures it produced since, adjust verification levels where the stakes or the track record moved, check whether a model update changed the frontier for your work, and fold any new examples or criteria back into your saved prompts and rubrics. The cadence does the remembering so you do not have to.

Each recalibration session runs the same improvement engine the rest of this book uses. You retest and play with what the new model can do, you imagine outputs better than the ones you are getting, you grade the results against your gold examples, you let the model surface what drifted, and then you evolve the system itself: the sharper rubric, the new gold example, the failure mode you had not named, and sometimes a better task to hand off entirely all get saved back into the grid and the test rig. That last step is the one that compounds. The next recalibration does not start from where this one did; it starts from a sharper standard and a higher ceiling, so your map gets not only refreshed but harder to fool over time. The trade-off still holds: use enough of this routine to protect the work, not so much that the routine becomes the work. Five rows on a monthly cadence protects a brief that real decisions ride on; auditing every throwaway prompt would be the routine eating the day.

Build your grid and set the cadence that keeps it alive

Build your trust calibration grid and recalibration schedulePaid book · Claude structures your AI tasks into a dated grid with verification levels and retest dates, then proposes a recalibration cadence. You confirm the ratings against fresh tests and put the cadence on your calendar.