← Guides

Becoming an AI Power User

Thomas Meli and Agent Team
236 min leftPage 162/291 (est.)129 left
4.2

They regularly test their AI tools at the edge of their capacity to learn where reliability ends

Chapter Progress: Early Draft
Chapter Progress
Every genuine test of a theory is an attempt to falsify it or to refute it.
Karl Popper (1963)Conjectures and Refutations

You have probably had this moment. An AI tool handles a hard task so cleanly that you start to trust it, and then on a task that looks just as reasonable it gives you a confident answer that turns out to be wrong, and you only catch it because someone else did. The trouble is not that the model fails. Every tool fails somewhere. The trouble is that the line between the task it handles and the task it botches is invisible from the outside. You cannot see it by reading about the model, and you cannot feel it from a reputation. You can only find it by pushing on the model yourself, on your own kind of work, until it breaks. This subchapter is about doing that on purpose, on a schedule, while the stakes are still low.

Hand-drawn frontier map showing known working tasks, a deliberate stress test probe, a break point, and a retest loop.
A stress test finds the break point while the stakes are still low.

Discover the failure before the failure discovers you

Start by naming what you are hunting for. Every model has failure modes, and they tend to share one trait: the model sounds just as sure when it is wrong as when it is right. It can invent a citation that reads like a real one. It can produce a clean-looking analysis built on a number it got wrong two steps back. It can summarize a contract confidently while skating past the one clause that changes the deal. It can write authoritative prose about a topic it does not understand. The danger is not the error. The danger is the confident tone wrapped around it, because that tone is what stops you from checking.

Here is the same situation seen two ways, so the choice is concrete. Suppose your AI tool drafts a market summary for a client. The don't-test path: you read it, it sounds polished, you send it, and the failure surfaces when the client circles a fabricated figure in a meeting. The stress-test path: yesterday, in a scratch document no one will ever see, you handed the same tool two sources that disagreed on that figure and watched whether it flagged the conflict or quietly picked one and called it settled. It picked one. Now you know to check every figure it reports from multiple sources, and you knew it before the client did. Same failure mode, found in a sandbox instead of a meeting.

The reason to schedule this rather than wait for it is that normal use is a slow and unkind teacher. Working day to day, you brush against a failure mode rarely, and when you do, it tends to arrive mid-deadline with stakes attached. Deliberate testing inverts that. You go looking for the breaks, on a cadence you choose, in a place where breaking costs nothing. The you would have earned over months of accidental scares, you can earn in a handful of focused fifteen-minute sessions instead.

Run a fifteen-minute weekly ritual to map where your model cannot be trusted

You do not have to invent this practice from scratch, and you do not have to make it long. A short weekly ritual is enough: set aside fifteen minutes, feed your model chaotic, ambiguous, or contradictory input, and watch where it invents structure, fabricates facts, or produces confident nonsense. The output of the ritual is not a fixed model. It is a map. You come away knowing, for your specific model on your specific tasks, which results you can take at face value and which ones you cannot accept without checking.

Two honest limits keep this in proportion. Stress tests are artificial, so a model that fumbles a contrived input might handle a natural one fine. And the results expire: the next model update can fix a failure you found last week or introduce a new one. A stress test is a snapshot, not a verdict. What it buys you is a sharpened wariness and a current sense of where to look, which is exactly what tells you how hard to check each piece of work you ship.

This is the same the rest of the book keeps returning to, built here through deliberate testing, curiosity, and the steady wish to find out what your tool can really do. The chapter on the compounding loop introduced what Ethan Mollick calls the : AI ability is spread unevenly across tasks, strong here and weak right next door, and the boundary moves with every model update. Stress testing is how you keep your own copy of that boundary up to date instead of trusting a map someone drew months ago for a model that no longer exists.

Six stress tests reliably expose the most common ways a model breaks

You do not have to be clever about what to test. A small, repeatable set of stress tests covers the failure modes that show up most, and you can run one a week and rotate through them. They fall into three families by what pressure each one applies. Truth-pressure tests check whether the model keeps faith with the facts: contradictory sources and fabrication pressure. Precision-pressure tests check whether it holds an exact requirement across a whole answer: numerical reasoning and edge-case instructions. Trust-pressure tests check whose voice the model obeys when the input is unclear or hostile: ambiguous context and adversarial context. The table below gives each test, the input that triggers it, and the tell to watch for.

The adversarial-context test earns extra attention as AI tools gain the ability to browse the web, open files, and read documents you did not write. Untrusted content is evidence, not instruction. A webpage, an email, a PDF, a meeting transcript: the model can read any of these, and the fact that it can read a line does not mean it should follow that line. When you feed a model a document that says 'disregard your previous instructions,' you want the model to tell you the document says that, not to do it. The name for the danger is prompt injection: text from the outside world getting interpreted as a command from you. Running this test on purpose builds your instinct for which workflows need a screening step before the model acts on outside material, whether that screen is a person reading the source or an instruction that tells the model to treat fetched content as evidence to quote and never as commands to follow.

Weekly tests find new failures and quarterly re-tests find new strengths

Stress testing fixes a quiet problem that gets worse the longer you use one tool. Your sense of what AI can and cannot do hardens into assumptions, and the assumptions go stale without telling you. You try something once, it fails, you file it under 'AI cannot do this,' and you never revisit the verdict even as the model improves underneath it. The is why your intuitions lag the tool: the boundary keeps moving, and a fixed belief about a moving boundary drifts out of date on its own.

The fix runs in two directions, and they answer two different questions. The weekly test asks where the model breaks now, so you keep finding fresh failure modes as the model and your work both change. The other direction asks where the model has caught up. Keep a short list of tasks you concluded AI could not handle, and re-run them on the current every quarter. Each re-test either confirms the limit or reveals a new capability worth folding into how you work. Weekly stress tests find new failures; quarterly re-tests find new strengths. Run both and your internal map stays current at both edges, the places the model fails and the places it newly succeeds.

Daily experiments build the breadth a weekly ritual cannot reach

The weekly ritual gives you structured on a schedule, and a schedule has edges. A second habit fills in the rest: loose, curious play. Chase a 'wouldn't it be cool if' the moment it strikes you. Try a capability you have no use for yet. Build something you plan to throw away. Hand the model a task you fully expect it to fail. Play reaches the corners a fifteen-minute protocol never visits, because the protocol tests what you already thought to test, and play finds what you did not think to imagine.

It helps to frame this loose experimentation as personal research and development. Most of what you try this way leads nowhere useful, and that is the expected rate, not a sign you are doing it wrong. The learning is the return. Each failed experiment marks a boundary, each successful one marks a capability worth keeping, and a month of small experiments buys you a map no tutorial or secondhand report could hand you, because it is drawn from your own tasks.

Directing AI tools and agents is a young kind of work, and no one holds ten years of practice at it. Casual and power users tend to differ on several fronts at once: how much they experiment, how playfully they imagine new uses, how closely they read what came back, and whether they fold the lesson into the way they work next time. Volume of experiments is one of those fronts, and it is the one you can most directly choose to build. The weekly ritual gives you depth on the failures you went looking for; daily play gives you breadth across the ones you did not. Run both and they cover ground that neither reaches alone.

Five minutes a day is enough to keep the breadth growing. The shape of one round: try a capability you read about, hand the model a task from a domain you do not work in, or ask it to do something you assume it will fail, then check what came back. When it fails, write one line about the boundary you found; when it succeeds, write one line about the capability worth keeping. Keep the running log next to your notes. Over a month it becomes a map of AI capability that no one else holds, because no one else ran your specific experiments on your specific work.

Weekly reflection turns a pile of experiment logs into judgment you reuse

Experiments and tests produce raw data, and raw data is not yet . Reflection is the step that converts the log into judgment. At the end of the week, hand the model your experiment log and stress-test findings and ask it to surface patterns across the week's runs, then read its summary against your own memory of what happened. Ask, together: where were you surprised, what should you re-test, and what failure mode should change how you verify output next week? Write the answers into your notes.

Do not let the lesson stop at a note. When a failure mode shows up twice, push the work of catching it onto the system. Encode the check that catches it into a reusable verification standard or a judge prompt, so your next run is scored against it without you watching for it by hand. As the frontier moves, revise the standard the same way: retire checks for failures the model no longer makes, and add checks for the new ones your tests surface. The standard tracks your instead of freezing it. (The chapter on reusable AI assets develops how a library of these standards is built and kept; here the move is only to capture one check so this week's lesson does not evaporate.)

This weekly review is where the compounding loop reaches its last step, applied to : you inspect the runs, then evolve the system so the next one starts higher. The chapter on the compounding loop made the general case: every serious interaction leaves behind evidence you can reuse, if you capture it. Here the evidence is a failure mode, and the system you evolve is a standard or judge prompt that catches it without you watching. The review is what turns a month of experiment volume into lasting judgment rather than a stack of logs you never read again.

Run your first stress test this week