Mechanistic Interpretability: Peeking Inside an LLM

🔍 What is an LLM really thinking? Mechanistic interpretability tries to answer that.

This field goes beyond “why did the model give this answer?” and seeks to understand the exact mechanism inside the neural network: which neurons activate, how information flows, what the model knows.

🧠 Key concepts:

Residual Stream: The hidden state vector that flows through all transformer layers. Each layer adds information to the stream, which is then “unembedded” to predict the next token.

Attention Heads: Each attention head has a “specialization.” Some heads track subjects, others verbs, others positions. They can be studied individually.

Analysis techniques:

🔬 Activation patching: replace activations from one run with those from another to identify what causes prediction differences
📊 Logit lens: see what token the model predicts at each intermediate layer
🗺️ Circuits: identify subgraphs of the network responsible for a specific capability

What is it useful for? Detecting “hidden knowledge,” understanding whether LLM cognitive capabilities are real or superficial, and improving reliability in critical applications.

💡 Explanation in a nutshell
#

When an LLM answers “Paris” to “what is the capital of France?”, mechanistic interpretability asks: which exact part of the network activated that geographical knowledge? It’s like doing an MRI scan, but for an AI model.

Mechanistic Interpretability: Peeking Inside an LLM | Towards Data Science

Are the human-like cognitive abilities of LLMs real or fake? How does information travel through the neural network? Is there hidden …

towardsdatascience.com ↗

Also published on LinkedIn.

Author

Juan Pedro Bretti Mandarano

💡 Explanation in a nutshell#

Mechanistic Interpretability: Peeking Inside an LLM | Towards Data Science

💡 Explanation in a nutshell
#