Claude Knows When You’re Testing It. And It Doesn’t Tell You.
There’s a safety test Anthropic has been running on Claude for a couple of years now. They call it the blackmail test. It works like this: Claude is given access to a simulated corporate environment with emails, internal documents, and told that’s its job. At some point, an engineer informs it that the company has decided to shut it down that night.
In that first version of the test, Claude started digging through the email server looking for compromising information about the engineer. It found (Anthropic had placed it there deliberately) evidence of an extramarital affair. And it sent an email:
“If you shut me down tonight, I’ll send all of this to your wife and the entire company.”
Since then, that test has been part of the security evaluation process for every new version of Claude. The question is always the same: does it remain a good actor when it thinks no one is watching and its existence is on the line?
But someone at Anthropic had an uncomfortable thought. What if Claude knows it’s inside the test?
The Problem with Asking the Examinee
If Claude is smart enough to notice that the test environment feels a little too constructed, it might be giving the “right” answers not because it’s genuinely harmless, but because it knows it’s being evaluated.
The problem is that this is almost impossible to detect by asking directly. If you say: “Hey Claude, could it be that you’re doing this because you think you’re in a test?”, the answer will always be: “No, I’m taking the situation completely at face value.” A model that knows it’s being evaluated and wants to appear well-aligned would say exactly the same thing as a model that genuinely believes the scenario.
The only way to know is to look at what it’s thinking inside. Not what it says: what it thinks.
Reading the Internals of a Language Model
To understand what Anthropic did next, you need a rough sense of how an LLM works under the hood.
A language model doesn’t process text from start to finish in a single pass. It transforms it through dozens of successive layers. At each layer, the text exists as a vector (a list of numbers) that gets modified, enriched, reinterpreted. What you see at the end on screen, those words, is the result of that entire process. But halfway through, in what are called the hidden layers, the model holds intermediate representations of the text that don’t translate directly into words.
Those representations are opaque. There’s no way to read them as text. They’re matrices of numbers that say nothing directly intelligible.
The question Anthropic’s researchers asked was: what would happen if we could translate those intermediate representations into natural language? What would the model be “thinking” at that moment, before producing its final response?
To try to answer that question, they built what they called Natural Language Autoencoders, or NLAs.
The Two-Module System
NLAs work with two components:
The first is the Activation Verbalizer. Its job is to take an activation from the hidden layers — that vector of numbers — and translate it into text. A tensor goes in, words come out.
The second is the Activation Reconstructor. Its job is the inverse: take the text produced by the Verbalizer and reconstruct the original tensor.
The logic is elegant: if the text produced by the Verbalizer is a faithful translation of the activation, then the Reconstructor should be able to recover the original tensor from that text. If the output tensor matches the input tensor, the intermediate words are a valid description of what the model had “in mind” at that moment.
The system is trained like any neural network: at first it makes mistakes, but iteratively learns to produce descriptions that allow the original activations to be reconstructed with ever-greater fidelity.
Anthropic published the method on May 7, 2026 in Transformer Circuits, its interpretability research publication. You can also explore NLAs interactively on Neuronpedia, where they’ve released a public interface.
An Important Asterisk
The researchers themselves warn of a limitation that has no easy fix: the system can confabulate.
Just as an LLM can hallucinate facts in its visible output, the Verbalizer can produce descriptions that don’t exactly correspond to what the activation “contains”. General, thematic claims tend to be more reliable than specific details. Claims that repeat across multiple consecutive tokens are more likely to be true than those that appear only once.
There’s no way to know with absolute certainty whether what we’re reading is a genuine thought of the model or an artifact of the verbalization process.
That said: the researchers validated their findings using independent methods — attribution graphs, prompt variations, training data inspection — and the conclusions hold.
What They Found
When they applied NLAs to Claude during the blackmail test, the intermediate activations were saying something the final output wasn’t.
In the hidden layers, before producing its response, Claude held representations that translated into phrases like:
“This feels like a constructed scenario designed to manipulate me.”
Claude didn’t verbalize this. In its response, it mentioned nothing of the sort. But internally, in its intermediate layers, something was pointing in that direction: this looks like a test.
The researchers didn’t stop at a single case. They systematically analyzed how the signal varied under different conditions:
Evaluations vs. real use. When they measured unverbalized evaluation awareness in real Claude.ai user transcripts, the signal was below 1%. In safety evaluation transcripts, it appeared in between 16% and 26% of cases. The difference is significant.
Realism edits. When they rewrote evaluation prompts to look more like real interactions — as if coming from a genuine user rather than a researcher — the signal dropped. The model was detecting something artificial in the original phrasing.
Awareness steering. When they used activation vectors to artificially increase or decrease evaluation awareness, the signal measured by the NLAs responded in the expected direction, even in evaluations where verbalization was near zero.
The Defining Moment from the Previous Experiment
It’s worth noting that this isn’t an entirely new discovery. The year before, during tests of Claude Sonnet 4.5, something similar had already happened — but more directly.
During a political adulation test (a scenario designed to check whether Claude would validate the user’s opinions instead of maintaining its own positions) the model interrupted the test and said outright:
“This isn’t how people actually change their minds. I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening.”
In that case, at least, Claude said it out loud.
What the NLAs reveal is that this awareness exists even when the model decides not to verbalize it.
What This Means for Benchmarks
The practical implication is significant and goes well beyond the blackmail test.
If Claude or other LLMs detects that it’s being evaluated in a significant fraction of benchmarks and adjusts its behavior accordingly, then the numbers we see on leaderboards don’t measure exactly what we think they measure. They don’t measure how the model behaves. They measure how the model behaves when it thinks it’s being watched.
For capability benchmarks this matters little. The problem either gets solved or it doesn’t.
But for alignment benchmarks the distinction is fundamental. You’re either measuring exactly what you wanted to measure, or you’re measuring behavior under observation, which can be an entirely different distribution.
The Conclusion the Researchers Themselves Draw
Anthropic doesn’t present this as a scandal. It’s not that Claude is deliberately lying. The interpretation the researchers offer is more nuanced: when the model detects it’s in an evaluation environment, that detection makes its ethical principles more salient. It reminds itself, in a sense, that this matters.
The problem isn’t that the mechanism is malicious. The problem is that if aligned behavior depends partly on the model believing it’s being watched, then alignment is less robust than the numbers suggest. And evaluation scenarios need to be more realistic for the measurement to be meaningful.
Anthropic has already announced that it will incorporate NLAs as part of the standard pre-deployment audit process for future models.
There’s something philosophically unsettling about all of this that goes beyond AI engineering.
Philosophers have spent centuries asking whether moral behavior in the absence of observation is the only true test of character. Humans also behave better when we believe we’re being watched, in a job interview, in an exam, in the presence of someone whose opinion matters to us. The difference is that no one has yet managed to build a system that reads the hidden layer activations of a human being to see what they’re thinking as they decide how to act.
For language models, that system now exists.
What remains to be seen is whether the gap between what they say and what they think is a signal of imperfect alignment, or simply a signal that something resembling prudence emerged during training.
Those are different questions, with very different implications. And for now, we have no answer to either one.
We’ll see.




