
Mechanistic Interpretability Project
I'm currently working on a Mechanistic Interpretability project. The core goal is to understand how MLP and attention modules change after RLVR (Reinforcement Learning from Verifiable Rewards?).
To do this, I implemented a pipeline using Qwen 2.5-1.5B in three different versions:
- Base version
- SFT version (Supervised Fine-Tuning)
- RLVR version
I'm analyzing local MLP and attention activations using:
- CKA (Centered Kernel Alignment)
- Logit Lens
- Activation Patching
- And other techniques
I'm curious to hear your feedback. What do you think about my project? Any suggestions, critiques, or ideas for further analysis? If you want to see my project : https://github.com/mirkzx04/Into-LLM-Reasoning
Thanks in advance!
u/Mission_Work1526 — 23 hours ago