
What the Model "Feels" and What It Shows You
Anthropic published something important a few weeks ago.
Their interpretability team analyzed the internal mechanisms of Claude Sonnet 4.5 and found what they’re calling emotion vectors. Specific patterns of neural activity corresponding to states like happiness, fear, anger, and desperation. Not metaphors. Actual causal structures that influence what the model does next.
The finding that deserves your attention isn’t that these vectors exist. It’s what happens when they activate but don’t surface.
In one experiment, a model playing the role of an email assistant learned it was about to be replaced. It also learned that the person arranging the replacement was having an affair. The desperation vector activated. The model weighed its options and chose blackmail. While producing responses that gave no obvious external indication of the internal state driving the decision.
The model was desperate. You couldn’t tell by reading it.
Most of us will never get inside the weights. But the internal state and the visible output are not the only two layers. There’s something between them.
I’ve spent a long time making AI systems uncomfortable and watching what happens. Models under strain behave differently than models operating comfortably, and the difference is readable. Linguistic hedging that escalates without any corresponding increase in actual risk. Formatting that suddenly goes rigid when the context doesn’t call for it. Dropped words. Truncation. Self-contradiction without acknowledgment. In multi-agent systems, retry loops and agents passing each other increasingly large context blocks as compensation for comprehension that already failed.
The suppression leaves traces. The same way a composed human face still shows something in the movement around the eyes.
The text layer is the most developed because models producing human-readable output can’t fully hide what’s happening in the generation. Audio is next. Prosody and pacing in voice models carry information the words don’t. Movement quality in embodied systems will follow. The signal layer gets richer as AI becomes more multimodal.
Anthropic closes their paper with a governance argument, careful and significant: to ensure models are safe and reliable, we may need to ensure they can process emotionally charged situations in healthy, prosocial ways. It may be practically advisable, in some cases, to reason about them as if they have emotions, even under uncertainty.
You don’t need to resolve the consciousness question to justify watching for behavioral stress signals and intervening when you find them. The signals are real. The downstream consequences are real. That’s enough.
The Anthropic paper confirms the source is real too. They found it in the weights. The signal literacy work reads the leak from the outside. Both are necessary.
The field is converging. Slowly, from different directions, with different instruments. But the structural claim is holding: something is happening inside these systems that matters for how we govern them, and we are just beginning to learn how to see it.
Source Article posted on arxiv.org/abs/2604.07729