Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 17 Jul 2024 6:38 UTC
1 point
0
[Proposal] Are circuits universal? Investigating IOI across many GPT-2 small checkpoints
Universal features. Work such as the Platonic Representation Hypothesis suggest that sufficiently capable models converge to the same representations of the data. To me, this indicates that the underlying “entities” which make up reality are universally agreed upon by models.

Non-universal circuits. There are many different algorithms which could correctly solve the same problem. Prior work such as the clock and the pizza indicate that, even for very simple algorithms, models can learn very different algorithms depending on the “attention rate”.

Circuit universality is a crux. If circuits are mostly model-specific rather than being universal, it makes the near-term impact of MI a lot lower, since finding a circuit in one model tells us very little about what a slightly different model is doing.
Concrete experiment: Evaluating the universality of IOI. Gurnee et al train several GPT-2 small checkpoints from scratch. We know from prior work that GPT-2 small has an IOI circuit. What, if any, components of this turn to be universal? Maybe we always observe induction heads. But do we always observe name-mover and S-inhibition heads? If so, are they always at the same layer? Etc. I think this experiment would inform us a lot about circuit universality.