abramdemski comments on Towards a Less Bullshit Model of Semantics

abramdemski 2 Jul 2024 15:44 UTC
LW: 4 AF: 4
0
AF
The central challenge of ML interpretability is to faithfully and robustly translate the internal concepts of neural nets into human concepts (or vice versa). But today, we don’t have a precise understanding of what “human concepts” are. Semantics gives us an angle on that question: it’s centrally about what kind of mental content (i.e. concepts) can be interoperable (i.e. translatable) across minds.
It seems to me like there’s an important omission here: we also don’t understand what we really want to point at whet we say “the internal concepts of neural nets”.
One might say that understanding “human concepts” is more the central difficulty here, because the human concepts are what we’re trying to translate into.
However, we also need to understand what we’re translating out of. For example, we might find a translation from NN activations to human concepts which is highly satisfying by some metric, but, which fails to uncover deceptive cognition within the NN. One idea for how to avoid this: ignoring content which we do not know how to translate into human concepts needs to count as a failure, rather than a success. Notice how this requires a notion of ‘content’ which we are trying to translate.
We can perhaps understand this as a ‘strategy-stealing’ requirement: to fully understand the content of an NN means to be able to replicate all of its capabilities using the translated content (importantly, including hidden capabilities which we don’t see on our test data).