interp.decode(activations)

Reading a Model's Mind

How Natural Language Autoencoders translate a model's internal state into plain English

Jun 7, 2026·11 min read·Kunsh Singh

InterpretabilityNLAAutoencodersRLAlignment

With AI being all anyone talks about lately, most of us have folded ChatGPT and Claude into our daily routines without thinking twice about what’s happening under the hood. You already know the basics — most LLMs run on a transformer architecture, stacked many layers deep. What most people don’t realize is that as your prompt moves through those layers, the model is passing around thousands of raw decimal numbers at every step. Not the words it’s about to say, but the machinery underneath them. Each of these values is part of what’s called an activation, and together they’re essentially how the model forms a thought — a response taking shape from your input, the way a reaction takes shape from a stimulus. The strange part? The model has no idea it’s having them.

These numbers are also effectively unreadable. They’re sitting right in front of us, but staring at a single activation is like getting a brain scan back with no clue which glowing region means what.

In May 2026, a team at Anthropic published a paper (Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations) that does something remarkable: it takes one of those number-soup vectors and translates it directly into plain English. Not the model’s output — its internal state. The method behind it is surprisingly clean, and by the end of this post you’ll understand exactly how it works.

01The problem with reading minds

The core problem is deceptively simple. A single activation is just a list of numbers — for a model like Llama 70B, it’s 8,192 of them sitting in a row. That row encodes something real about what the model is doing. We simply have no way of looking at it and saying what.

One clarification before going further, because the terminology trips people up: “activation” refers to two different things. The activation function — ReLU and its relatives — is the small mathematical operation that zeroes out negative values and gives the model its non-linearity. Activations, the noun, are the actual vectors flowing through the network. They’re related but distinct, and this post is about the second one — the vectors we’re trying to read.

That non-linearity deserves a moment, because it’s the entire reason any of this works. Consider the stock market. If TSLA went up every single time Elon tweeted — one clean, predictable cause and effect — that relationship would be linear. But the market is never that polite. Thousands of tangled factors push prices in ways no straight line can capture, and that messiness is non-linearity. Activation functions are precisely what let a model learn it — the whole reason a model can pick up anything more interesting than “X always causes Y.”

We’ve had tools for peeking at activations for a while now — the logit lens, sparse autoencoders (SAEs) — and they do help. But all of them force you to describe a thought as a weighted bag of pieces pulled from some fixed list (tokens, or learned “features”), and a human still has to squint at a pile of examples and conclude that feature 4071 is probably the Golden Gate Bridge. It works; it’s just clunky, and you’re always translating.

What you actually want is for the model to tell you what it’s thinking in the most expressive format humans have ever built: plain language. That’s the entire pitch.

02Two models, one bottleneck

The architecture comes down to two models talking to each other.

The first is the activation verbalizer — the AV. You hand it a single activation (one of those 8,192-number vectors, taken from one word at one layer) and it writes you a short paragraph describing what the model appears to be doing at that exact point. Which immediately raises the obvious question: how do you know it isn’t making that up? How do you fact-check a description of a thought?

That’s the job of the second model, the activation reconstructor — the AR. It reads the AV’s paragraph and attempts to rebuild the original vector from the text alone, and then you measure how far off it landed. If the AV wrote a description rich enough for the AR to nail the rebuild, the description must have carried the real information. If it wrote fluff, the rebuild falls apart. The gap between the two is essentially an honesty meter.

Squeeze an activation through “describe it,” then “rebuild it from only your description,” and the words in the middle are forced to be real. The bottleneck isn’t a smaller vector like a normal autoencoder — it’s a paragraph of English. That’s the trick.

nla.forward()

hₗ

activation

verbalizer

explanation

reconstructor

ĥₗ

reconstruction

INPUT ACTIVATION hₗ

0.31-0.88 0.12 0.74-0.05 0.43-0.61 0.19…(d_model)

EXPLANATION z (natural language)

awaiting verbalizer…

RECONSTRUCTION ĥₗ

awaiting reconstructor…

// the whole thing is trained to make ĥₗ ≈ hₗ — no labels, just “say it back to me”

// simulation — illustrative vectors & examples adapted from the paper, not a live model

And that’s the entire architecture — the AV is the model you actually read, while the AR exists mostly to keep it honest.

03Training without an answer key

You’d assume teaching a model to describe activations requires a giant answer key — a dataset of (activation, correct description) pairs. It doesn’t, because no such dataset exists. There is no ground truth at all. The only instruction either model ever receives is: rebuild the vector. That’s the entire objective.

The elegant part is that the two models are trained in completely different ways. The AR is the easy one — text in, vector out, all smooth math — and gets trained the normal way, nudged step by step toward closer rebuilds (plain regression).

The AV is the problem child. Its output is sampled words — discrete, jumpy choices — and you can’t run smooth calculus through “which word did it pick.” Instead, you train it the way you’d train a dog. That’s reinforcement learning: the AV writes a description, receives a reward based on how well the AR rebuilt the vector, and gradually learns to write the kind of descriptions that earn bigger rewards. No correct answer is ever shown — just treats for good reconstructions. (A couple of guardrails keep it from drifting into fluent nonsense: a penalty for straying too far from its starting point, plus a warm-up on a summarization task so it doesn’t babble on day one.)

Nothing in this setup asks the explanation to be readable, true, or even related to what the activation means. It only asks: can your partner rebuild the vector?

Here’s why this is a real result and not a party trick: train the system purely to reconstruct, and the explanations become more informative anyway. The researchers track this with a metric called FVE — roughly, “how much of the activation did we actually recover” — and over training it climbs from around 0.35 to 0.6–0.8. Drag through it yourself:

fve = 1 − reconstruction_loss / variance

FVE 0.78@ 10k steps

EXPLANATION QUALITY @ 10k steps

Diminishing returns on FVE, but explanations are detailed and consistently informative across adjacent tokens.

// drag the slider — notice the explanations get better even though we only ever optimized for reconstruction

Reconstruction quality and readability rise together, for free, without anyone labeling a single thing — and the pattern held across three different Claude models. It’s worth pausing on that: you optimize purely for “say it back to me,” and a readable window into the model’s head falls out as a side effect.

04What it’s actually saying

This is where it stops being a clever autoencoder and starts feeling like a microscope.

Take the classic example. Ask the model for a rhyming couplet, and the NLA reveals that at the line break — before the second line even exists — the model is already holding a plan for the word it wants to land the rhyme on. It’s thinking ahead, and you can watch it happen. Click around:

click a highlighted token → read its NLA explanation

Arhymingcouplet:Hesawacarrotandhadtograbit⏎Hishungerwaslikeastarvingrabbit

NLA EXPLANATION · token “newline”

Planning ahead. Before writing line 2, the model already represents a plan to end the couplet on the rhyme “rabbit.” It is also tracking the “-abit/-abbit” rhyme family (habit, grab it) and the hunger theme.

// illustrative excerpts in the spirit of the paper’s Planning-in-Poetry case study — read for themes, not gospel

And this isn’t just an interpretation — it’s causal. You can edit the explanation (swap “rabbit” for “mouse”), push both the original and edited text back through the AR, take the difference between the two rebuilt vectors, and use that difference to steer the model’s rhyme in practice. The English description becomes a literal handle you can grab and turn: read a thought, then change it.

05Where it gets genuinely useful: auditing

The headline use case is alignment auditing, and the most striking finding there is something called unverbalized evaluation awareness — the model quietly suspecting “this is a test” without ever saying it out loud. The NLA pulls that suspicion straight out of the activations. The model is keeping a secret, and you get to read it anyway.

It also cracked a strange bug during the Opus 4.6 pre-launch audit: the model kept randomly answering English prompts in Russian, and the NLA’s explanations kept insisting the user was “really” a native Russian speaker. That pointed the team directly at broken training data — English prompts paired with untranslated foreign-language answers. A real debugging win, earned by reading the model’s mind. And on a full auditing benchmark, NLA-equipped agents beat every other method — they even uncovered a deliberately-sabotaged model without ever seeing the data that sabotaged it.

06Where it breaks down

I’m excited about this work, not blind to it. The big catch is confabulation — these explanations will sometimes state things that are flat-out wrong about the context. The saving grace is that they tend to be wrong on the specifics while staying right on the theme: mention a historical dynasty, and the explanation might invent a specific king who was never there. Claims that repeat across nearby words also tend to be more trustworthy, which means you read for the gist rather than the fine print. Beyond that, the NLA is a black box by design (it can’t tell you which part of the vector produced which sentence), the AV is itself a full language model and can over-infer things the activation never actually stored, and the whole approach is expensive — reinforcement learning on two full models, with hundreds of words generated per single activation.

07Why this matters

Most interpretability work makes you learn its language — feature dictionaries, attribution graphs, probe directions. NLAs flip that relationship and hand you the model’s state in the one format you’re already fluent in. That’s a different kind of accessible, and I think it matters more than it sounds.

The paper even sketches the endgame: general activation language models that translate freely between activation-space and English, in both directions. And right now, each NLA reads only one layer at a time — which reads less like a wall and more like an invitation.

Anthropic released the training code, trained NLAs for open models, and an interactive frontend on Neuronpedia where you can poke real activations and watch the explanations fall out. Go break it. I’ve been staring at those reconstruction scores convinced there’s room to push them — but that’s a different post, and probably a sandbox of my own. 👀

// sources & further reading