SF AI Reading Group — February

Our February reading group tackled a question that's been nagging at the edges of the field: what does it even mean to understand a neural network?

What We Read

The primary text was the Anthropic mechanistic interpretability paper on superposition — the phenomenon where networks seem to represent more features than they have neurons, by cleverly overlapping representations in high-dimensional space.

We also looked at a shorter piece on the difference between behavioral and mechanistic explanations, which ended up generating most of the discussion.

The Central Debate

The room divided roughly into two camps:

The pragmatists argued that mechanistic interpretability is valuable but overambitious. If we can predict model behavior reliably, do we need to understand the mechanism? Airplanes fly; not everyone needs to understand aerodynamics.

The epistemics-first camp pushed back: behavioral understanding is fragile. If we don't know why a model does something, we can't predict when it will fail in novel contexts. The history of deep learning is littered with brittle heuristics mistaken for robust capabilities.

The most interesting moment: someone pointed out that we don't fully understand how human cognition works either, and we still manage to build reliable institutions around human behavior. Maybe some interpretability gap is acceptable.

Where We Landed

No consensus — which is what you want from a good reading group. The most useful reframe was distinguishing between "understanding for safety" vs. "understanding for science." These might require different standards and methods.

Next Month

We're reading two pieces on training dynamics and phase transitions. If you want to join, the waitlist is at the link below — we keep the group at ~35 to maintain discussion quality.