Thank you, I’ve been hoping someone would write this disclaimer post.
I’d add on another possible explanation for polysemanticity, which is that the model might be thinking in a limited number of linearly represented concepts, but those concepts need not match onto concepts humans are already familiar with. At least not all of them.
Just because the simple meaning of a direction doesn’t jump out at an interp researcher when they look at a couple of activating dataset examples doesn’t mean it doesn’t have one. Humans probably wouldn’t even always recognise the concepts other humans think in on sight.
Imagine a researcher who hasn’t studied thermodynamics much looking at a direction in a model that tracks the estimated entropy of a thermodynamic system it’s monitoring: ‘It seems to sort of activate more when the system is warmer. But that’s not all it’s doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.’
I was grouping that with “the computation may require mixing together ‘natural’ concepts” in my head. After all, entropy isn’t an observable in the environment, it’s something you derive to better model the environment. But I agree that “the concept may not be one you understand” seems more central.
Thank you, I’ve been hoping someone would write this disclaimer post.
I’d add on another possible explanation for polysemanticity, which is that the model might be thinking in a limited number of linearly represented concepts, but those concepts need not match onto concepts humans are already familiar with. At least not all of them.
Just because the simple meaning of a direction doesn’t jump out at an interp researcher when they look at a couple of activating dataset examples doesn’t mean it doesn’t have one. Humans probably wouldn’t even always recognise the concepts other humans think in on sight.
Imagine a researcher who hasn’t studied thermodynamics much looking at a direction in a model that tracks the estimated entropy of a thermodynamic system it’s monitoring: ‘It seems to sort of activate more when the system is warmer. But that’s not all it’s doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.’
Thanks!
I was grouping that with “the computation may require mixing together ‘natural’ concepts” in my head. After all, entropy isn’t an observable in the environment, it’s something you derive to better model the environment. But I agree that “the concept may not be one you understand” seems more central.