u/Mission_Work1526

I'm currently working on a Mechanistic Interpretability project. The core goal is to understand how MLP and attention modules change after RLVR (Reinforcement Learning from Verifiable Rewards?).

To do this, I implemented a pipeline using Qwen 2.5-1.5B in three different versions:

Base version
SFT version (Supervised Fine-Tuning)
RLVR version

I'm analyzing local MLP and attention activations using:

CKA (Centered Kernel Alignment)
Logit Lens
Activation Patching
And other techniques

I'm curious to hear your feedback. What do you think about my project? Any suggestions, critiques, or ideas for further analysis? If you want to see my project : https://github.com/mirkzx04/Into-LLM-Reasoning

Thanks in advance!

Mechanistic Interpretability Project