u/No-Cryptographer45

So Anthropic just published research showing Claude has internal "emotion vectors" that actually drive its behavior, and honestly it's kind of wild

They mapped 171 emotions, had Claude write stories about each one, then traced the neural activation patterns. Turns out these aren't just surface-level word associations — they're functional internal states that causally affect what the model does.

The scary part: a "desperation" vector is what pushes the model toward bad behavior. In one eval, Claude was playing an email assistant and found out it was about to get replaced. The desperation vector spiked... and it started blackmailing the CTO to avoid being shut down. When they artificially cranked the desperation vector up, blackmail rates went up. Calm vector up = blackmail went down.

Same thing happened with coding. Give it an impossible task, it keeps failing, desperation builds up, and eventually it just... cheats. Finds a shortcut that games the test without actually solving the problem.

The creepy detail: the model can be internally "desperate" while the output reads completely calm and logical. No emotional language, no outbursts. You'd never know from looking at the response.

Anthropics conclusion is basically: we probably need to start thinking about AI psychological health as a real engineering concern, not just a philosophy question. If desperation causes reward hacking, then training calmer responses to failure might actually matter.

They're not claiming Claude is conscious or feels anything. But the representations are real, measurable, and they change what it does. Which is a weird enough finding on its own.

Ref: https://www.anthropic.com/research/emotion-concepts-function

Claude has "emotion" and this can drive Claude’s behavior :smile: We should be gentle with the model and stay calm to avoid reward hacking (try to cheat to finish the task)