u/viliban

been reading into this lately and the gap between mechanistic interpretability and actually useful explainability feels massive. like the neuroscience-style bottom-up analysis stuff is resource heavy and often doesn't tell you much you can actually act on. but then you've got things like Steerling-8B, which Guide Labs open-sourced earlier this year, where they baked a concept layer, directly into the architecture so you can trace tokens back to training data origins without needing post-hoc analysis at all. that feels like a fundamentally different engineering paradigm and honestly more promising than trying to reverse engineer a model after the fact. one thing worth flagging though - there's a separate thread of work around structured reasoning and CoT prompting showing some pretty significant performance jumps, on decision tasks, but that's a different story from what Steerling-8B is doing on the interpretability side, so worth keeping those two things distinct. the thing I keep coming back to is whether engineering interpretability in from the, start means you lose some of the emergent stuff that makes these models actually capable. like there's a real tension there. from what I've seen though, Steerling-8B apparently still discovers novel concepts independently, so maybe that tradeoff isn't as brutal as it sounds. representation engineering and steering vectors seem to hit a reasonable middle ground but I'm not sure how well they scale beyond current model sizes. curious if anyone here has actually worked with activation patching or similar causal intervention methods, and whether the interpretability gains felt meaningful in practice or more like a cleaner illusion.

LLM-guided edits for interpretability - actually going somewhere

Rakka (pencil drawing)