We originally thought OthelloGPT had nonlinear representations but they turned out to be linear. This highlights that the features used in the model’s ontology do not necessarily map to what humans would intuitively use.
Short explanation: Neel’s short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. “The Louvre is cool. Obama was born in” … be completed with ” Rome” too.
[Note] On illusions in mechanistic interpretability
We thought SoLU solved superposition, but not really.
ROME seemd like a very cool approach but turned out to have a lot of flaws. Firstly, localization does not necessarily inform editing. Secondly, editing can induce side effects (thanks Arthur!).
We originally thought OthelloGPT had nonlinear representations but they turned out to be linear. This highlights that the features used in the model’s ontology do not necessarily map to what humans would intuitively use.
Max activating examples have been shown to give misleading interpretations of neurons / directions in BERT.
I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733
Short explanation: Neel’s short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. “The Louvre is cool. Obama was born in” … be completed with ” Rome” too.