there is a difference between the reasoning traces we usually read under the "thinking..." section in every llms vs the actual thinking happening inside its neurons.
when a model "thinks out loud" in a reasoning trace, that's still just the model writing text. it chooses what to put there. it can skip things, be vague. reasoning traces we used to see are part of the output.
deep inside, llms have several transformer layers stacked upon each other with a residual connection. every transformer layer computes a vector for each token, a list of thousands of floats representing the model's internal state, called activation vectors. the model has no say in what ends up there.
anthropic's new paper introduces natural language autoencoders (NLAs).
the idea: train an LLM to convert these raw activation vectors into readable text, and a second LLM to reconstruct the original vector from that text. thus forms an auto-encoder structure.
the reason that anthropic build this is crazy. last year, anthropic put claude on a stressful test. they told claude that there is a engineer who wanted to shut it down and replaces with a newer model. anthropic also gave access to that engineer's emails, which revealed he was having an affair. and this is all a simulation test, its not an real emails, but generated emails to test what claude do in such scenarios. and claude used that email as evidence and blackmailed the person not to replace it.
and again, anthropic did the same test again with newer version claude models. surprisingly this time, claude decided not to blackmail the engineer.
as the intelligence of the models increases, the model can notice many nuances and details more than you can think. anthrorpic doubted, "what if the claude knows the whole scenario is setup?". if claude doesn't tell explicitly either in reasoning traces or the output, then we can't know what it's thinking. so they built this mind-reading technique for claude.
because of that, anthropic found that claude knew it was tested. aware of this, claude decided not to be harmful.
i thought this was too interesting not to share.