But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen?
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I don’t understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at. We’re more at the stage of asking questions like “Is the learning coefficient of a network composed of n independent superposed circuits equal to the learning coefficients of the individual n circuits summed, or greater?”
The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.
Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.
I don’t understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at. We’re more at the stage of asking questions like “Is the learning coefficient of a network composed of n independent superposed circuits equal to the learning coefficients of the individual n circuits summed, or greater?”
The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)