They maintain calibration as models change because every AI opinion has an expiration date
Chapter Progress: Early DraftEssentially, all models are wrong, but some are useful.
In January you ask AI to summarize a client call, and it nails the tone and catches the one decision that mattered. You learn to trust it for that. So in April you stop reading its summaries closely, because why would you, it has earned the pass. What you cannot see is that a model update in March changed how it handles the part of the call where people talk over each other, and now it sometimes drops the decision and keeps the small talk. The trust you formed in January is still in your head. The behavior it was based on is gone. You are working from a map of a place that has since been rebuilt.
The repair is one habit, and the rest of this chapter is that habit in three parts. You keep a dated map of where AI is strong and where it breaks in your own work, and you redraw it on a schedule and after every model update. A small table holds the map. A short list of triggers tells you when to retest. A recurring appointment makes sure the redrawing gets done. None of the three leads. The table without the schedule goes stale, the schedule without the table has nothing to update, and a model update with neither catches you flat. Held together, they keep a living map current with a few minutes of work a month.

A dated grid turns your scattered trust judgments into a map you can see
Right now your sense of where AI is reliable lives in your head as a loose feeling: this kind of task usually works, that kind needs a second look. The trouble with a feeling is that it has no date and no edges. You cannot tell which parts are fresh and which went stale months ago, because a feeling does not record when you last checked it. The fix is to write the map down in a form that carries dates. A trust grid is a small table, one row per recurring task, that records how reliable the model has been at that task, how much you verify it, and when you last tested it. Putting it on paper does one thing your memory cannot: it shows you, at a glance, which judgments are current and which are running on trust you can no longer account for.
Use the morning brief as the running example, the same one this book threads throughout: getting AI to write you a short rundown of your day so you can plan it, built as one piece of a setup shaped to how you live and work. As of 2026 you build it by describing what you want to a chat model and refining what comes back, and the interface will keep changing under it, to voice, to glasses, to whatever follows. The brief leans on a handful of recurring tasks: reading your calendar, pulling the open items off your task list, drafting a one-line note about the day's hardest meeting. Each of those is a row in the grid. For each row you record the model and tier you use, whether it has been tested in the last 30 days, the failure already observed on that task, and the level of verification you apply before you trust the result. You do not have to catch every failure by eye: you can ask the model to grade a batch of its own past outputs against a short rubric and surface the cases where it slipped, so the failure column fills itself from evidence rather than from memory.
Don't apply this and the grid stays a feeling with no edges. You keep using the brief every morning, sense vaguely that the calendar part is solid and the hard-meeting note is hit-or-miss, and never write either down. A model update lands and you cannot say which rows it touched, because you never had rows. Apply it and each judgment carries a date you can act on. The calendar row says tested two weeks ago, reliable, light verification. The hard-meeting-note row says last tested in February, reliability unknown since, full read-through before you trust it. The second row is now visibly a candidate for your next retest, and you can see that without reconstructing two months of mornings from memory.
Match how hard you check to the task to avoid two opposite mistakes
The grid tells you how much a task can be trusted; the next question is what to do with that. The instinct is to pick one verification habit and apply it to everything, but a single setting is wrong in two directions at once. Check everything by hand and you throw away the speed that made AI worth using, and you do it most on the dependable tasks where the checking buys you nothing. Trust everything and a confident error on a high-stakes task reaches someone before you catch it. The setting that avoids both is not a single setting; it is one you tune per task, from evidence. You verify hard where the stakes are high and the track record is thin or unknown, and you trust more freely where the task is well-tested and an error would cost little, and you stay accountable for whatever you ship either way.
Three rows from the morning brief show the dial at three settings. Reformatting your task list into a clean rundown is tested and low-stakes, so you let it run and glance at the result. The one-line note about your hardest meeting is medium-stakes and generally reliable, so you spot-check it against what you know of that meeting. A row that pulls a figure into the brief, say a budget number a decision will hang on, is high-stakes with a known habit of confident errors, so you check the source every time regardless of how good the model has looked lately. Same person, same morning, three different amounts of checking, each one set by the task in front of you rather than by a blanket rule. The prose has built the idea; here is its name.
The grid is what makes a call you can defend instead of a guess. When a row has been tested recently and held up, you lower verification on it and reclaim the time. When a row is new, untested, or carries a history of failure, you raise verification until a fresh test earns it back down. When the stakes are high enough, you verify fully no matter how clean the track record looks, because the cost of being wrong, not the model's average, decides the floor.
After a model update, retest the three kinds of task most likely to have shifted
A model update is the event that can redraw whole regions of your map at once, and it is tempting to respond by retesting everything or by ignoring it and hoping. Both tend to cost you. Retesting everything is the practice eating the work; ignoring it leaves you trusting behavior that may no longer exist. The move is a targeted sweep of the rows most likely to have moved. Three families of task are worth the minutes, and naming them keeps the sweep short. An update is also the most natural moment to play: it is the day to throw a task the old model could never touch at the new one and see what it can now do, because the boundary you mapped last month may have jumped.
The first family is tasks the model used to fail. An update is the most likely reason a task that was out of reach last month is in reach now, so a row marked unreliable is worth a fresh attempt; the boundary may have moved toward you. The second family is tasks the model used to ace. The same update that lifts one task can quietly drop another, and a regression on a task you stopped watching is the kind of failure that reaches a stakeholder before you catch it, so retest your strongest rows precisely because you trust them. The third family is tasks that lean on a specific behavior, a particular format, a tone, a way of handling a tricky input, because a subtle shift in behavior is easy to miss and can break a brief that depended on the old way.
A recurring appointment is what turns the intention to recalibrate into recalibration
Everything to this point is sound and can still go undone, because recalibration competes with live work and tends to lose to it. An intention to retest soon is how a map goes stale while feeling tended. What converts the intention into the act is a recurring appointment you keep with yourself, the same way you keep a standing meeting. On that cadence you open the grid, retest any row you have not checked in 60 days, ask the model to flag the failures it produced since, adjust verification levels where the stakes or the track record moved, check whether a model update changed the frontier for your work, and fold any new examples or criteria back into your saved prompts and rubrics. The cadence does the remembering so you do not have to.
Each recalibration session runs the same improvement engine the rest of this book uses. You retest and play with what the new model can do, you imagine outputs better than the ones you are getting, you grade the results against your gold examples, you let the model surface what drifted, and then you evolve the system itself: the sharper rubric, the new gold example, the failure mode you had not named, and sometimes a better task to hand off entirely all get saved back into the grid and the test rig. That last step is the one that compounds. The next recalibration does not start from where this one did; it starts from a sharper standard and a higher ceiling, so your map gets not only refreshed but harder to fool over time. The trade-off still holds: use enough of this routine to protect the work, not so much that the routine becomes the work. Five rows on a monthly cadence protects a brief that real decisions ride on; auditing every throwaway prompt would be the routine eating the day.
