In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as “Is it going to kill us all?”… trying to scam people we’re simulating for money). I think this is an unrealistically optimistic picture, but I don’t see how it’s ruled out specifically by the arguments in this post.
This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I’m going to be honest here.
The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.
If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting “intentional direct lethality” and “explicitly rendered deception”.
One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don’t know how to do that, so for now the argument doesn’t seem that forceful to me (it sounds more like one of these impossibility results that sometimes don’t matter in practice, like no free lunch theorems).
There is an equivocation here. The conclusion presumes that applying Peter’s arguments to interpretability of misalignment cases that people like you currently have in mind is a sound and complete test of whether Peter’s arguments matter in practice – for understanding the detection possibility limits of interpretability over all human-lethal misalignments that would be manifested in/by self-learning/modifying AGI over the long term.
Worse, this test is biased toward best-case misalignment detection scenarios.
Particularly, it presumes that misalignments can be read out from just the hardware internals of the AGI, rather than requiringthe simulation of the larger “complex system of an AGI’s agent-environment interaction dynamics” (quoting the TD;LR).
That larger complex system is beyond the memory capacity of the AGI’s hardware, and uncomputable. Uncomputable by:
the practical compute limits of the hardware (internal input-to-output computations are a tiny subset of all physical signal interactions with AGI components that propagate across the outside world and/or feed back over time).
the sheer unpredictability of non-linearly amplifying feedback cycles (ie. chaotic dynamics) of locally distributed microscopic changes (under constant signal noise interference, at various levels of scale) across the global environment.
My understanding is that chaotic dynamics often give rise to emergent order. For biological systems, the copied/reproduced components inside can get naturally selected for causing dynamics that move between chaotic and orderly effects (chaotic enough to be adaptively creative across varying environmental contexts encountered over the system’s operational lifecycle; orderly enough that effects are reproduced when similar contexts reappear). But I’m not a biology researcher – would be curious in Peter’s thoughts!
This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I’m going to be honest here.
The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.
If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting “intentional direct lethality” and “explicitly rendered deception”.
There is an equivocation here. The conclusion presumes that applying Peter’s arguments to interpretability of misalignment cases that people like you currently have in mind is a sound and complete test of whether Peter’s arguments matter in practice – for understanding the detection possibility limits of interpretability over all human-lethal misalignments that would be manifested in/by self-learning/modifying AGI over the long term.
Worse, this test is biased toward best-case misalignment detection scenarios.
Particularly, it presumes that misalignments can be read out from just the hardware internals of the AGI, rather than requiring the simulation of the larger “complex system of an AGI’s agent-environment interaction dynamics” (quoting the TD;LR).
That larger complex system is beyond the memory capacity of the AGI’s hardware, and uncomputable.
Uncomputable by:
the practical compute limits of the hardware (internal input-to-output computations are a tiny subset of all physical signal interactions with AGI components that propagate across the outside world and/or feed back over time).
the sheer unpredictability of non-linearly amplifying feedback cycles (ie. chaotic dynamics) of locally distributed microscopic changes (under constant signal noise interference, at various levels of scale) across the global environment.
My understanding is that chaotic dynamics often give rise to emergent order. For biological systems, the copied/reproduced components inside can get naturally selected for causing dynamics that move between chaotic and orderly effects (chaotic enough to be adaptively creative across varying environmental contexts encountered over the system’s operational lifecycle; orderly enough that effects are reproduced when similar contexts reappear). But I’m not a biology researcher – would be curious in Peter’s thoughts!
See further comments here.