Logit lens

A technique that peeks at what a model is predicting at each internal layer before it reaches the final answer, like checking a student's rough drafts before they submit the essay.

A model processes input through dozens of layers, and each layer refines the prediction further. The logit lens intercepts the signal at any layer, translates it into word probabilities (how likely each possible next word is), and shows what the model would predict if it stopped processing right there. Early layers often predict generic, common words; deeper layers converge on the specific answer. An improved version called the tuned lens adds a small learned correction at each layer, producing cleaner intermediate snapshots.

Builder example

Logit lens reveals that model answers are built incrementally, with different layers handling different levels of abstraction. Future debugging tools could pinpoint at which layer a model goes wrong: showing, for instance, that the correct answer was forming through layer 20 and then got overridden by layer 25. That kind of diagnostic could help teams understand systematic failure patterns.

Common confusion: Logit lens output is an approximate projection, a researcher's reconstruction of intermediate states. It does not capture everything happening at a given layer, and the model itself has no awareness of being observed this way.