This post argues that mechanistic interpretability’s scope of application is too limited. Your comment describes two misalignment examples that are (maybe) within mechanistic interpretability’s scope of application.
Therefore, this post (and Limited Upside of Interpretability) applies to your comment – by showing the limits of where the comment’s premises apply – and not the other way around.
To be more specific You gave two examples for the commonly brought up cases of intentional direct lethality and explicitly rendered deception: “is it (intending to) going to kill us all” and “checking whether the AI is still in a training sandbox...and e.g. trying to scam people we’re simulating for money”.
The two examples given are unimaginative in terms of how human-lethal misalignment can (and would necessarily) look like over the long run. They are about the most straightforward AGI misalignment scenarios we could wish to detect for.
Here are some desiderata those misalignment threats would need to meet to be sufficiently detectable (such to correct them to not cause (a lot of) harm over the long run):
Be fully explicitly represented (at least as a detectable abstraction that is granular enough for the misalignment to be correctable) by code within the AGI. Ie. the misaligned outcomes are not manifested implicitly as a result of the specific iterated interactions between the changing AGI internals and changing connected surroundings of the (more dynamically complex) environment.
Be compressible in their fine-grained characteristics (code parameters, connectivity, potential/actual inputs over time, all possible output channels of influence on the environment) such for relevant aspects not to be filtered out by the necessarily lossy data-compression processes of the mechanistic interpretability system, such to be comparable and “alignable” against higher-level reference values across semantic dimensions relevant to human safety.
Be detected at a virtually 100% false negative rate across all instances of those particular kinds of misalignment over the long term.
Be complicated mechanistically, but not chaotically complex. Ie. not causing of emergent dynamics that destabilise environmental conditions yet are not granularly predictable enough to prevent.
Be less complicated to model (sufficiently to simulate and prevent/correct the misalignment) than how complicated the mechanistic interpretability system is or has the capacity to be (by Good Regulator Theorem).
Not be instantiated by a self-modifying AGI, where given the eventual NP computational complexity of subsets of the initialised, optimised for, connected up and physically enduring code that are computed in physical interactions with the changing connected surroundings of the environment, the long-term outside effects of that code are actually unpredictable for the vast majority of human-safety-relevant aspects. Like how Busy Beaver functions are almost completely unpredictable in their outputs without fully computing their code ahead of time, and like how evolving biological systems are even less predictable in their physical behaviour over time.
Perhaps you can dig into a few of these listed limits and come back on this?
This post argues that mechanistic interpretability’s scope of application is too limited. Your comment describes two misalignment examples that are (maybe) within mechanistic interpretability’s scope of application.
Therefore, this post (and Limited Upside of Interpretability) applies to your comment – by showing the limits of where the comment’s premises apply – and not the other way around.
To be more specific
You gave two examples for the commonly brought up cases of intentional direct lethality and explicitly rendered deception: “is it (intending to) going to kill us all” and “checking whether the AI is still in a training sandbox...and e.g. trying to scam people we’re simulating for money”.
The two examples given are unimaginative in terms of how human-lethal misalignment can (and would necessarily) look like over the long run. They are about the most straightforward AGI misalignment scenarios we could wish to detect for.
Here are some desiderata those misalignment threats would need to meet to be sufficiently detectable (such to correct them to not cause (a lot of) harm over the long run):
Be fully explicitly represented (at least as a detectable abstraction that is granular enough for the misalignment to be correctable) by code within the AGI. Ie. the misaligned outcomes are not manifested implicitly as a result of the specific iterated interactions between the changing AGI internals and changing connected surroundings of the (more dynamically complex) environment.
Be compressible in their fine-grained characteristics (code parameters, connectivity, potential/actual inputs over time, all possible output channels of influence on the environment) such for relevant aspects not to be filtered out by the necessarily lossy data-compression processes of the mechanistic interpretability system, such to be comparable and “alignable” against higher-level reference values across semantic dimensions relevant to human safety.
Be detected at a virtually 100% false negative rate across all instances of those particular kinds of misalignment over the long term.
Be complicated mechanistically, but not chaotically complex. Ie. not causing of emergent dynamics that destabilise environmental conditions yet are not granularly predictable enough to prevent.
Be less complicated to model (sufficiently to simulate and prevent/correct the misalignment) than how complicated the mechanistic interpretability system is or has the capacity to be (by Good Regulator Theorem).
Not be instantiated by a self-modifying AGI, where given the eventual NP computational complexity of subsets of the initialised, optimised for, connected up and physically enduring code that are computed in physical interactions with the changing connected surroundings of the environment, the long-term outside effects of that code are actually unpredictable for the vast majority of human-safety-relevant aspects. Like how Busy Beaver functions are almost completely unpredictable in their outputs without fully computing their code ahead of time, and like how evolving biological systems are even less predictable in their physical behaviour over time.
Perhaps you can dig into a few of these listed limits and come back on this?