My model for why interpretability research might be useful, translated into how I understand this post’s ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.
I think it’s obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it’s smarter than us, we can’t predict exactly what actions it will take). I’m not sure if you are claiming something stronger about what we won’t be able to predict?
In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as “Is it going to kill us all?”). For example, we might figure out that the AI is running some computations that look like they’re checking whether the AI is still in a training sandbox. The output of those computations seems to influence a bunch of other stuff going on in the AI. If we intervene on this output, the AI behaves very differently (e.g. trying to scam people we’re simulating for money). I think this is an unrealistically optimistic picture, but I don’t see how it’s ruled out specifically by the arguments in this post.
As an analogy: while we can’t predict which moves AlphaZero is going to make without running it, we can still make very important coarse-grained predictions, such as “it’s going to win”, if we roughly know how AlphaZero works internally. You could imagine an analogous chess playing AI that’s just one big neural net with learned search. If interpretability can tell us “this thing is basically running MCTS, its value function assigns very high value to board states where it’s clearly winning, …”, we could make an educated guess that it’s a good chess player without ever running it.
One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don’t know how to do that, so for now the argument doesn’t seem that forceful to me (it sounds more like one of these impossibility results that sometimes don’t matter in practice, like no free lunch theorems).
Thank you so much, Erik, for your detailed and honest feedback! I really appreciate it.
I agree with you that it is obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
I am not completely pessimistic about interpretability of coarse-grained information, although still somewhat pessimistic. Even in systems neuroscience, interpretability of coarse-grained information has seen some successes (in contrast to interpretability of fine-grained information, which has seen very little success).
I agree that if the interpretability researcher is extremely lucky, they can extract facts about the AI that lets them make important coarse-grained predictions with only a short amount of time and computational resources.
But as you said, this is an unrealistically optimistic picture. More realistically, the interpretability researcher will not be magically lucky, which means we should expect the rate at which prediction-enhancing information is obtained to be inefficient.
And given that information channels are dual-use (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI’s sandbox escape compared to their usefulness to alignment researchers.
Perhaps to demonstrate that this is a practical concern rather than just a theoretical concern, let me ask the following. In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
I agree with you that it is obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
I only agree with the first sentence here, and I don’t think the rest of the paragraph follows from it. I agree being able to safely experiment on AGIs would be useful, but it’s not a replacement for what interpretability is trying to do. Deception is a good example here: how do you empirically tell whether a model is deceptive without giving it a chance to actually execute a treacherous turn? You’d have to fool the model, and there are big obstacles to that. Maybe relaxed adversarial training could help, but that’s also more of a research direction than a concrete method for now—I think for any specific alignment approach, it’s easy to find challenges. If there is a specific problem that people are currently planning to solve with interpretability, and that you think could be better solved using some other method based on safely experimenting with the model, I’d be interested to hear that example, that seems more fruitful than abstract arguments. (Alternatively, you’d have to argue that interpretability is just entirely doomed and we should stop pursuing it even lacking better alternatives for now—I don’t think your arguments are strong enough for that.)
But as you said, this is an unrealistically optimistic picture.
I want to clarify that any story for solving deception (or similarly big obstacles) that’s as detailed as what I described seems unrealistically optimistic to me. Out of all stories this concrete that I can tell, the interpretability one actually looks like one of the more plausible ones to me.
In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
This is actually something I’d be interested to read more about (e.g. I think a post looking at what lessons we can learn for interpretability from neuroscience and attempts to understand the brain could be great). I don’t know much about this myself, but some off-the-cuff thoughts:
I think mechanistic interpretability might turn out to be intractably hard in the near future, and I agree that understanding the brain being hard is some evidence for that
OTOH, there are some advantages for NN interpretability that feel pretty big to me: we can read of arbitrary weights and activations extremely cheaply at any time, we can get gradients of lots of different things, we can design networks/training procedures to make interpretability somewhat easier, we can watch how the network changes during its entire training, we can do stuff like train networks on toy tasks to create easier versions to study, and probably more I’m forgetting right now.
Your post briefly mentions these advantages but then dismisses them because they do “not seem to address the core issue of computational irreducibility”—as I said in my first comment, I don’t think computational irreducibility rules out the things people realistically want to get out of interpretability methods, which is why for now I’m not convinced we can draw extremely strong conclusions from neuroscience about the difficulty of interpretability.
ETA: so to answer you actual question about what I think happened with the HBP: in part they didn’t have those advantages (and without those, I do think mechanistic interpretability would be insanely difficult). Based on the Guardian post you linked, it also seems they may have been more ambitious than interpretability researchers? (i.e. actually making very fine-grained predictions)
In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as “Is it going to kill us all?”… trying to scam people we’re simulating for money). I think this is an unrealistically optimistic picture, but I don’t see how it’s ruled out specifically by the arguments in this post.
This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I’m going to be honest here.
The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.
If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting “intentional direct lethality” and “explicitly rendered deception”.
One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don’t know how to do that, so for now the argument doesn’t seem that forceful to me (it sounds more like one of these impossibility results that sometimes don’t matter in practice, like no free lunch theorems).
There is an equivocation here. The conclusion presumes that applying Peter’s arguments to interpretability of misalignment cases that people like you currently have in mind is a sound and complete test of whether Peter’s arguments matter in practice – for understanding the detection possibility limits of interpretability over all human-lethal misalignments that would be manifested in/by self-learning/modifying AGI over the long term.
Worse, this test is biased toward best-case misalignment detection scenarios.
Particularly, it presumes that misalignments can be read out from just the hardware internals of the AGI, rather than requiringthe simulation of the larger “complex system of an AGI’s agent-environment interaction dynamics” (quoting the TD;LR).
That larger complex system is beyond the memory capacity of the AGI’s hardware, and uncomputable. Uncomputable by:
the practical compute limits of the hardware (internal input-to-output computations are a tiny subset of all physical signal interactions with AGI components that propagate across the outside world and/or feed back over time).
the sheer unpredictability of non-linearly amplifying feedback cycles (ie. chaotic dynamics) of locally distributed microscopic changes (under constant signal noise interference, at various levels of scale) across the global environment.
My understanding is that chaotic dynamics often give rise to emergent order. For biological systems, the copied/reproduced components inside can get naturally selected for causing dynamics that move between chaotic and orderly effects (chaotic enough to be adaptively creative across varying environmental contexts encountered over the system’s operational lifecycle; orderly enough that effects are reproduced when similar contexts reappear). But I’m not a biology researcher – would be curious in Peter’s thoughts!
My model for why interpretability research might be useful, translated into how I understand this post’s ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.
I think it’s obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it’s smarter than us, we can’t predict exactly what actions it will take). I’m not sure if you are claiming something stronger about what we won’t be able to predict?
In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as “Is it going to kill us all?”). For example, we might figure out that the AI is running some computations that look like they’re checking whether the AI is still in a training sandbox. The output of those computations seems to influence a bunch of other stuff going on in the AI. If we intervene on this output, the AI behaves very differently (e.g. trying to scam people we’re simulating for money). I think this is an unrealistically optimistic picture, but I don’t see how it’s ruled out specifically by the arguments in this post.
As an analogy: while we can’t predict which moves AlphaZero is going to make without running it, we can still make very important coarse-grained predictions, such as “it’s going to win”, if we roughly know how AlphaZero works internally. You could imagine an analogous chess playing AI that’s just one big neural net with learned search. If interpretability can tell us “this thing is basically running MCTS, its value function assigns very high value to board states where it’s clearly winning, …”, we could make an educated guess that it’s a good chess player without ever running it.
One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don’t know how to do that, so for now the argument doesn’t seem that forceful to me (it sounds more like one of these impossibility results that sometimes don’t matter in practice, like no free lunch theorems).
Thank you so much, Erik, for your detailed and honest feedback! I really appreciate it.
I agree with you that it is obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
I am not completely pessimistic about interpretability of coarse-grained information, although still somewhat pessimistic. Even in systems neuroscience, interpretability of coarse-grained information has seen some successes (in contrast to interpretability of fine-grained information, which has seen very little success).
I agree that if the interpretability researcher is extremely lucky, they can extract facts about the AI that lets them make important coarse-grained predictions with only a short amount of time and computational resources.
But as you said, this is an unrealistically optimistic picture. More realistically, the interpretability researcher will not be magically lucky, which means we should expect the rate at which prediction-enhancing information is obtained to be inefficient.
And given that information channels are dual-use (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI’s sandbox escape compared to their usefulness to alignment researchers.
Perhaps to demonstrate that this is a practical concern rather than just a theoretical concern, let me ask the following. In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
I only agree with the first sentence here, and I don’t think the rest of the paragraph follows from it. I agree being able to safely experiment on AGIs would be useful, but it’s not a replacement for what interpretability is trying to do. Deception is a good example here: how do you empirically tell whether a model is deceptive without giving it a chance to actually execute a treacherous turn? You’d have to fool the model, and there are big obstacles to that. Maybe relaxed adversarial training could help, but that’s also more of a research direction than a concrete method for now—I think for any specific alignment approach, it’s easy to find challenges. If there is a specific problem that people are currently planning to solve with interpretability, and that you think could be better solved using some other method based on safely experimenting with the model, I’d be interested to hear that example, that seems more fruitful than abstract arguments. (Alternatively, you’d have to argue that interpretability is just entirely doomed and we should stop pursuing it even lacking better alternatives for now—I don’t think your arguments are strong enough for that.)
I want to clarify that any story for solving deception (or similarly big obstacles) that’s as detailed as what I described seems unrealistically optimistic to me. Out of all stories this concrete that I can tell, the interpretability one actually looks like one of the more plausible ones to me.
This is actually something I’d be interested to read more about (e.g. I think a post looking at what lessons we can learn for interpretability from neuroscience and attempts to understand the brain could be great). I don’t know much about this myself, but some off-the-cuff thoughts:
I think mechanistic interpretability might turn out to be intractably hard in the near future, and I agree that understanding the brain being hard is some evidence for that
OTOH, there are some advantages for NN interpretability that feel pretty big to me: we can read of arbitrary weights and activations extremely cheaply at any time, we can get gradients of lots of different things, we can design networks/training procedures to make interpretability somewhat easier, we can watch how the network changes during its entire training, we can do stuff like train networks on toy tasks to create easier versions to study, and probably more I’m forgetting right now.
Your post briefly mentions these advantages but then dismisses them because they do “not seem to address the core issue of computational irreducibility”—as I said in my first comment, I don’t think computational irreducibility rules out the things people realistically want to get out of interpretability methods, which is why for now I’m not convinced we can draw extremely strong conclusions from neuroscience about the difficulty of interpretability.
ETA: so to answer you actual question about what I think happened with the HBP: in part they didn’t have those advantages (and without those, I do think mechanistic interpretability would be insanely difficult). Based on the Guardian post you linked, it also seems they may have been more ambitious than interpretability researchers? (i.e. actually making very fine-grained predictions)
This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I’m going to be honest here.
The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.
If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting “intentional direct lethality” and “explicitly rendered deception”.
There is an equivocation here. The conclusion presumes that applying Peter’s arguments to interpretability of misalignment cases that people like you currently have in mind is a sound and complete test of whether Peter’s arguments matter in practice – for understanding the detection possibility limits of interpretability over all human-lethal misalignments that would be manifested in/by self-learning/modifying AGI over the long term.
Worse, this test is biased toward best-case misalignment detection scenarios.
Particularly, it presumes that misalignments can be read out from just the hardware internals of the AGI, rather than requiring the simulation of the larger “complex system of an AGI’s agent-environment interaction dynamics” (quoting the TD;LR).
That larger complex system is beyond the memory capacity of the AGI’s hardware, and uncomputable.
Uncomputable by:
the practical compute limits of the hardware (internal input-to-output computations are a tiny subset of all physical signal interactions with AGI components that propagate across the outside world and/or feed back over time).
the sheer unpredictability of non-linearly amplifying feedback cycles (ie. chaotic dynamics) of locally distributed microscopic changes (under constant signal noise interference, at various levels of scale) across the global environment.
My understanding is that chaotic dynamics often give rise to emergent order. For biological systems, the copied/reproduced components inside can get naturally selected for causing dynamics that move between chaotic and orderly effects (chaotic enough to be adaptively creative across varying environmental contexts encountered over the system’s operational lifecycle; orderly enough that effects are reproduced when similar contexts reappear). But I’m not a biology researcher – would be curious in Peter’s thoughts!
See further comments here.