Thanks for writing this! It’s always good to get critical feedback about a potential alignment direction to make sure we aren’t doing anything obviously stupid. I agree with you that finegrained prediction of what an AGI is going to do in any situation is likely computationally irreducible even with ideal interpretability tools.
I think there are three main arguments for interpretability which might well be cruxes.
As Erik says, interpretability tools potentially let us make coarse-grained predictions about the model utilizing fine-grained information. While predicting everything the model will do is probably not feasible in advance, it might be very possible to get pretty detailed predictions of coarse-grained information such as ‘is this model deceptive’, ‘does it have potentially misaligned mesaoptimizers’, ‘does its value function look reasonably aligned with what we want given the model’s ontology’, ‘is it undergoing / trying to undergo FOOM’? The model’s architecture might also be highly modular so that we could potentially understand/bound a lot of the behaviour of the model that is alignment-relevant while only understanding a small part. This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth. We can then potentially get a lot of mileage out of just interpreting the planner and value function while the exact details of how the model represents, say, chairs in the world model, are less important for alignment
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc. Having such a theory would be super important for alignment (although would not solve it directly). Interpretability work provides key bits of evidence that can be generalized to build this theory.
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system. This could potentially involve directly editing out mesaoptimizers or deceptive behaviours, and adjusting goal misgeneralization by tweaking the internal ontology of the model. There is a lot of times in science where we can produce reliable and useful interventions in systems with only partial information and understanding of their most fine-grained workings. Nevertheless, we need some understanding and this is what interpretability appears to give a reliable path to realizing.
Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
There are quite a lot of reasons why we should expect interpretability to be much easier than neuroscience
We know exactly the underlying computational graph of our models – this would be akin to in neuroscience starting out knowing exactly how neurons, synapses etc work as well as knowing the full connectome and the large scale architecture of the brain.
We know the exact learning algorithm our models use – in neuroscience this would be starting out knowing, say, the cortical update rule as well as the brain’s training objective/loss function.
We know the exact training data our models are trained on
We can experiment on copies of the same model as opposed to different animals with different brains / training data / life histories etc
We can instantly read all activations, weights, essentially any quantity of interest simultaneously and with perfect accuracy – simply being able to read neuron firing rates is very difficult in neuroscience and we have basically no ability to read large numbers of synaptic weights
We can perform arbitrary interventions at arbitrarily high fidelity on our NNs
These points mean experimental results are orders of magnitude faster and easier to get. A typical interpretability experiment looks like: load model into memory, perform precise intervention on model, look at a huge number of possible outputs, iterate. Neuroscience experiments often look like train mice to do some task for months, insert probes or do some broad based intervention where you are not sure exactly what you are measuring or what your intervention actually affected, get a bunch of noisy data from a small sample with potential systematic errors/artifacts from your measurement process where you can only read a tiny fraction of what you would like to read, try to understand what is going on. It is much harder and slower!
Secondly, the blue brain project is just one example of a high profile failure in neuroscience and we shouldn’t generalize too much based on it. I have had no experience with the blue brain project, but it seems plausible to me that this could just be a standard case of garden-variety mismanagement and overoptimistic goals and hype. From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.
Given this, it is not clear to me that interpretability is doomed to failure and it seems a reasonably high EV bet to me.
In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
If by ‘empiricism’ as a channel you mean ‘just run the model and see what it does’, then while some amount of empiricism is necessary, it is also a highly dangerous channel to observe and so we should be trying to get as many bits as possible without requiring empiricism. Pre-deployment interpretability is one way to potentially get a lot of bits of information (although I agree not all the bits) without having to expose yourself to this danger. Hopefully interpretability, even if using only fairly coarse grained information can already eliminate a lot of ‘obviously dangerous AGIs’ before you have to run them.
Some empiricism will likely be necessary though and I agree that designing highly secure test-harnesses for the AGI akin to BSL-4 labs will be another necessary component to safety. However, high-powered interpretability will be a vital safety measure to be applied inside such a safe test harness. Ideally, while the AGI is running in some simulation, with interpretability tools we can be constantly scanning it for deception, for realization that it is in a simulation, for any kind of other malicious intent, and then abort or intervene to fix it if this is the case. I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.
And given that information channels are dual-use (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI’s sandbox escape compared to their usefulness to alignment researchers.
This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.
Thank you so much for your insightful and detailed response, Beren! I really appreciate your time.
The cruxes seem very important to investigate.
This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth.
It probably helps to have the AGI’s architecture hand-designed to be more human-interpretable. My model is that on the spectrum of high-complexity paradigms (e.g., deep learning) to low-complexity paradigms (e.g., software design by a human software engineer), having the AGI’s architecture be hand-designed moves away from the former and towards the latter, which helps reduce computational irreducibility and thereby increase out-of-distribution predictability (e.g., on questions like “Is the model deceptive?”).
However, my guess is that in order for out-of-distribution predictability of the system to be nontrivial, one would need to go substantially towards the low-complexity end of the spectrum. This would make it unlikely for the model to achieve AGI-like capabilities.
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc.
It would be great if we can get a statistical-mechanics-like theory. Realistically, such a theory would probably be a combination of “small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable.”
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system.
I can’t rule it out (gene editing is the optimistic outside view here), but I expect such interventions to be imperfect and vulnerable to unforseeable side effects. These interventions would likely require immense trial-and-error to achieve nontrivial out-of-distribution predictability.
From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.
I agree with the characterization “not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.” Indeed, the main point I wanted to make was my skepticism that interpretability research on AGI-scale models will progress fast enough to solve alignment in a short amount of time.
I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.
I completely agree, as long as the lines of defense are net-positive rather than net-negative. An information channel should only be opened if it helps the alignment researchers (to align the AGI) more than it helps the likely by-default misaligned AGI (to prematurely escape its sandbox).
This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.
My guess is that the reward-risk tradeoff of empirical channels—empiricism in the deployment environment or a precise replica of the deployment environment—is more likely to be favorable than that of interpretability channels. This is because the reward is high (empiricism efficiently yields predictive information) while ways to mitigate the risk are human-understandable (as opposed to “complexity is the enemy of security”).
If the AGI doesn’t have prior knowledge of our interpretability plans and how to exploit them (e.g., by deceiving us into a false sense of security), then I would be much more optimistic about the reward-risk tradeoff of interpretability channels! This is why we should keep our interpretability plans inacessible to the AGI as much as possible. Since there is plausibly a significant amount of secrecy value in interpretability plans, we should:
preemptively keep our interpretability plans off of the Internet,
filter these plans from the AGI’s training set, and
ensure that the AGI cannot inspect its own internal data.
I have been campaigning for these measures, but many AI safety researchers have generally opposed adopting Measure #1. I hope this will change going forward!
Thanks for writing this! It’s always good to get critical feedback about a potential alignment direction to make sure we aren’t doing anything obviously stupid. I agree with you that finegrained prediction of what an AGI is going to do in any situation is likely computationally irreducible even with ideal interpretability tools.
I think there are three main arguments for interpretability which might well be cruxes.
As Erik says, interpretability tools potentially let us make coarse-grained predictions about the model utilizing fine-grained information. While predicting everything the model will do is probably not feasible in advance, it might be very possible to get pretty detailed predictions of coarse-grained information such as ‘is this model deceptive’, ‘does it have potentially misaligned mesaoptimizers’, ‘does its value function look reasonably aligned with what we want given the model’s ontology’, ‘is it undergoing / trying to undergo FOOM’? The model’s architecture might also be highly modular so that we could potentially understand/bound a lot of the behaviour of the model that is alignment-relevant while only understanding a small part. This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth. We can then potentially get a lot of mileage out of just interpreting the planner and value function while the exact details of how the model represents, say, chairs in the world model, are less important for alignment
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc. Having such a theory would be super important for alignment (although would not solve it directly). Interpretability work provides key bits of evidence that can be generalized to build this theory.
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system. This could potentially involve directly editing out mesaoptimizers or deceptive behaviours, and adjusting goal misgeneralization by tweaking the internal ontology of the model. There is a lot of times in science where we can produce reliable and useful interventions in systems with only partial information and understanding of their most fine-grained workings. Nevertheless, we need some understanding and this is what interpretability appears to give a reliable path to realizing.
There are quite a lot of reasons why we should expect interpretability to be much easier than neuroscience
We know exactly the underlying computational graph of our models – this would be akin to in neuroscience starting out knowing exactly how neurons, synapses etc work as well as knowing the full connectome and the large scale architecture of the brain.
We know the exact learning algorithm our models use – in neuroscience this would be starting out knowing, say, the cortical update rule as well as the brain’s training objective/loss function.
We know the exact training data our models are trained on
We can experiment on copies of the same model as opposed to different animals with different brains / training data / life histories etc
We can instantly read all activations, weights, essentially any quantity of interest simultaneously and with perfect accuracy – simply being able to read neuron firing rates is very difficult in neuroscience and we have basically no ability to read large numbers of synaptic weights
We can perform arbitrary interventions at arbitrarily high fidelity on our NNs
These points mean experimental results are orders of magnitude faster and easier to get. A typical interpretability experiment looks like: load model into memory, perform precise intervention on model, look at a huge number of possible outputs, iterate. Neuroscience experiments often look like train mice to do some task for months, insert probes or do some broad based intervention where you are not sure exactly what you are measuring or what your intervention actually affected, get a bunch of noisy data from a small sample with potential systematic errors/artifacts from your measurement process where you can only read a tiny fraction of what you would like to read, try to understand what is going on. It is much harder and slower!
Secondly, the blue brain project is just one example of a high profile failure in neuroscience and we shouldn’t generalize too much based on it. I have had no experience with the blue brain project, but it seems plausible to me that this could just be a standard case of garden-variety mismanagement and overoptimistic goals and hype. From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.
Given this, it is not clear to me that interpretability is doomed to failure and it seems a reasonably high EV bet to me.
If by ‘empiricism’ as a channel you mean ‘just run the model and see what it does’, then while some amount of empiricism is necessary, it is also a highly dangerous channel to observe and so we should be trying to get as many bits as possible without requiring empiricism. Pre-deployment interpretability is one way to potentially get a lot of bits of information (although I agree not all the bits) without having to expose yourself to this danger. Hopefully interpretability, even if using only fairly coarse grained information can already eliminate a lot of ‘obviously dangerous AGIs’ before you have to run them.
Some empiricism will likely be necessary though and I agree that designing highly secure test-harnesses for the AGI akin to BSL-4 labs will be another necessary component to safety. However, high-powered interpretability will be a vital safety measure to be applied inside such a safe test harness. Ideally, while the AGI is running in some simulation, with interpretability tools we can be constantly scanning it for deception, for realization that it is in a simulation, for any kind of other malicious intent, and then abort or intervene to fix it if this is the case. I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.
This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.
Thank you so much for your insightful and detailed response, Beren! I really appreciate your time.
The cruxes seem very important to investigate.
It probably helps to have the AGI’s architecture hand-designed to be more human-interpretable. My model is that on the spectrum of high-complexity paradigms (e.g., deep learning) to low-complexity paradigms (e.g., software design by a human software engineer), having the AGI’s architecture be hand-designed moves away from the former and towards the latter, which helps reduce computational irreducibility and thereby increase out-of-distribution predictability (e.g., on questions like “Is the model deceptive?”).
However, my guess is that in order for out-of-distribution predictability of the system to be nontrivial, one would need to go substantially towards the low-complexity end of the spectrum. This would make it unlikely for the model to achieve AGI-like capabilities.
It would be great if we can get a statistical-mechanics-like theory. Realistically, such a theory would probably be a combination of “small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable.”
I can’t rule it out (gene editing is the optimistic outside view here), but I expect such interventions to be imperfect and vulnerable to unforseeable side effects. These interventions would likely require immense trial-and-error to achieve nontrivial out-of-distribution predictability.
I agree with the characterization “not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.” Indeed, the main point I wanted to make was my skepticism that interpretability research on AGI-scale models will progress fast enough to solve alignment in a short amount of time.
I completely agree, as long as the lines of defense are net-positive rather than net-negative. An information channel should only be opened if it helps the alignment researchers (to align the AGI) more than it helps the likely by-default misaligned AGI (to prematurely escape its sandbox).
My guess is that the reward-risk tradeoff of empirical channels—empiricism in the deployment environment or a precise replica of the deployment environment—is more likely to be favorable than that of interpretability channels. This is because the reward is high (empiricism efficiently yields predictive information) while ways to mitigate the risk are human-understandable (as opposed to “complexity is the enemy of security”).
If the AGI doesn’t have prior knowledge of our interpretability plans and how to exploit them (e.g., by deceiving us into a false sense of security), then I would be much more optimistic about the reward-risk tradeoff of interpretability channels! This is why we should keep our interpretability plans inacessible to the AGI as much as possible. Since there is plausibly a significant amount of secrecy value in interpretability plans, we should:
preemptively keep our interpretability plans off of the Internet,
filter these plans from the AGI’s training set, and
ensure that the AGI cannot inspect its own internal data.
I have been campaigning for these measures, but many AI safety researchers have generally opposed adopting Measure #1. I hope this will change going forward!