Thank you for this critique! They are always helpful to hone in on the truth.
So as far as I understand your text, you argue that fine-grained interpretability loses out against “empiricism” (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!
You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes’ comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.
Additionally, you argue that interpretability and ELK won’t succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:
1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field’s inception 7 years ago.
It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it’s a matter of speed seems completely fine but this is another argument and isn’t emphasized in the text.
2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?).
Maybe it’s just my misunderstanding of what you mean by fine-grained interpretability, but we don’t need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.
For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.’s work (2020).
When it comes to your characterization of the “empirical” method, this seems fine but doesn’t conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?
I falter to understand the specific agenda from this that isn’t done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.
Again, thank you for the post and I always like when people cite McElreath, though I don’t see his arguments apply as well to interpretability since we don’t model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan’s work.
Thank you so much for your detailed and insightful response, Esben! It is extremely informative and helpful.
So as far as I understand your text, you argue that fine-grained interpretability loses out against “empiricism” (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!
The main benefit of interpretability, if it can succeed, is that one can predict harmful future behavior (that would have occurred when deployed out-of-distribution) by probing internal data. This allows the researchers to preemptively prevent the harmful behavior: for example, by retraining after detecting deceptive intent. If this is scientifically possible, it would be a substantial benefit, especially since it is generally difficult to obtain out-of-distribution predictions from atheoretical empiricism.
However, I am skeptical that interpretability can achieve nontrivial success in out-of-distribution predictions, especially in the amount of time alignment researchers will realistically have. The reason is that deceptive intent is likely a fine-grained trait at the internal-data level (rather than at the behavioral level). Consequently, computational irreducibility is likely to impose a hard bound on predicting deceptive intent out-of-distribution, at least when assuming realistic amounts of time and resources.
My guess is that detecting deceptive intent solely from a neural net’s internal data is probably at least as fine-grained as behavioral genetics or neuroscience. These fields have made some progress, but preemptively predicting behavioral traits from internal data remains mostly unsolved.
For example, consider a question analogous to that of deceptive misalignment: ‘Is the given genome optimized for inclusive fitness, or is it optimized for a proxy goal that deviates from inclusive fitness in certain historically unprecedented environments?’ We know that evolutionary pressures select for maximizing inclusive fitness. However, the genome is optimized not for inclusive fitness, but for a proxy goal (survive and engage in sexual intercourse) that deviates from inclusive fitness in environments that are sufficiently distinct from ancestral environments.
How did scientists find out that the genome is optimized for a proxy goal? Almost entirely from behavior. We have a coarse-grained behavioral model that is quite good and generalizable. Evolution shaped animals’ behavior towards a drive for sexual intercourse, but historically unprecedented environmental changes (e.g., widespread availability of birth control) has made this proxy goal distinct from inclusive fitness. Parsimonious models based on first principles that are likely to be correct, like the above one, have a realistic chance of achieving situation-specific predictability that generalizes out-of-distribution.
In contrast, there is still very little understanding of which genes interact to cause animals’ sex drive. Which genes affect sex drive? Probably a substantial proportion of them, and they probably interact in interconnected and nonlinear ways (including with the extremely complex, multidimensionally varying environment) to produce behavioral traits in an unpredictable manner. Moreover, a lot of the information needed to predict behavioral traits like sex drive will lie in the specific environment and how it interacts with the genome. Only the most coarse-grained of these interaction dynamics will be predictable via bypassing empiricism with a statistical-mechanics-like model, due to computational irreducibility. And such a coarse-grained model will likely be rooted in behavior-based abstractions.
Deep-learning neural nets do come with an advantage lacked by behavioral genetics and neuroscience: a potentially complete knowledge of the internal data, the environmental data, and the data of their interaction throughout the whole training process.
But there is a missing piece: complete knowledge of the deployment environment. Any internals-based model of deceptive intent that alignment researchers can come up with is only guaranteed to hold in the subset of environments that the researchers have empirically tested. In the subset of environments that the polycausal model has not been tested in, there is no a priori reason that the model will generalize correctly. A barrier to generalizability is posed by the nonlinear and interconnected interactions between the neural net’s internals and the unprecedented environment, which can and likely will manifest differently depending on the environment. Relaxed adversarial training can help test a wider variety of environments, but this is still hampered by the blind spot of being unable to test the subset of environments that cannot be instantiated at human-level capabilities (e.g., the environment in which RSA encryption is broken). Thus, my guess is that the intrinsic out-of-distribution predictability of the AGI neural net’s behavior would be low, just like that of behavioral genetics or neuroscience.
For a conceptual example, consider the fact that the dynamics of cellular automata can change drastically with just one cell’s change in the initial conditions. See Figure 1 of Beckage et al. (Code 1599 in Wolfram’s A New Kind of Science), reproduced below:
In general, the only way to accurately ascertain how a computationally irreducible agent will behave in a complex environment is to run it in that environment. Even with complete knowledge of the agent’s internals, incomplete knowledge of the environment is sufficient to constrain a priori predictability. I expect that many predictions yielded by interpretability tools in the pre-deployment environment will fail to generalize to the post-deployment environment, unless the two are equal.
It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it’s a matter of speed seems completely fine but this is another argument and isn’t emphasized in the text.
Sorry for the miscommunication! I meant to say that the rate at which mechanistic interpretability will yield useful, generalizable information is slow, not zero.
But this is sufficient for concern because informational channels are dual-use; the AGI can use it for sandbox escape. We should only open an interpretability channel if the rate of scientific benefit exceeds the rate of cost (risk of premature sandbox escape by a misaligned AGI).
My opinion is that while mechanistic interpretability has made some progress, the rate at which this progress is happening is not fast enough to solve alignment in a short amount of time and computational resources. So far, the rate of progress in interpretability research has been substantially outpaced by that in AI capabilities research. I think this was predictable, due to what we know about computational irreducibility.
Maybe it’s just my misunderstanding of what you mean by fine-grained interpretability, but we don’t need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.
Roughly speaking, there is a spectrum between high-complexity paradigms of design (e.g., deep learning) and low-complexity, modular paradigms of design (e.g., software design by a human software engineer). My guess is that for many complex tasks, the optimal equilibrium strategy can be achieved only by the former, and attempting to meaningfully move towards the latter end of the spectum will result in sacrificing performance. For example, I expect that we won’t be able to build AGI via modular software design by a human software engineer, but that we will be able to build it by deep learning.
Again, thank you for the post and I always like when people cite McElreath, though I don’t see his arguments apply as well to interpretability since we don’t model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan’s work.
In Ethan’s scaling law, extrapolatory generalization is only guaranteed to be valid locally (“perfectly extrapolate until the next break”), and not globally. This is completely consistent with my prior. My assertion was that in order to globally extrapolate empirical findings to an unknown deployment environment, only simple models have a nontrivial chance of working (assuming realistic amounts of time and computational resources). These simple models will likely be based on parsimonious first principles that we have strong reason to be valid even in the unknown environment. And consequently, they will likely be largely based on behavioral data rather than the internal data of the agent-environment interaction dynamics.
Thank you for this critique! They are always helpful to hone in on the truth.
So as far as I understand your text, you argue that fine-grained interpretability loses out against “empiricism” (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!
You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes’ comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.
Additionally, you argue that interpretability and ELK won’t succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:
1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field’s inception 7 years ago.
It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it’s a matter of speed seems completely fine but this is another argument and isn’t emphasized in the text.
2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?).
Maybe it’s just my misunderstanding of what you mean by fine-grained interpretability, but we don’t need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.
For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon.
For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.’s work (2020).
When it comes to your characterization of the “empirical” method, this seems fine but doesn’t conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?
I falter to understand the specific agenda from this that isn’t done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.
Again, thank you for the post and I always like when people cite McElreath, though I don’t see his arguments apply as well to interpretability since we don’t model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan’s work.
Thank you so much for your detailed and insightful response, Esben! It is extremely informative and helpful.
The main benefit of interpretability, if it can succeed, is that one can predict harmful future behavior (that would have occurred when deployed out-of-distribution) by probing internal data. This allows the researchers to preemptively prevent the harmful behavior: for example, by retraining after detecting deceptive intent. If this is scientifically possible, it would be a substantial benefit, especially since it is generally difficult to obtain out-of-distribution predictions from atheoretical empiricism.
However, I am skeptical that interpretability can achieve nontrivial success in out-of-distribution predictions, especially in the amount of time alignment researchers will realistically have. The reason is that deceptive intent is likely a fine-grained trait at the internal-data level (rather than at the behavioral level). Consequently, computational irreducibility is likely to impose a hard bound on predicting deceptive intent out-of-distribution, at least when assuming realistic amounts of time and resources.
My guess is that detecting deceptive intent solely from a neural net’s internal data is probably at least as fine-grained as behavioral genetics or neuroscience. These fields have made some progress, but preemptively predicting behavioral traits from internal data remains mostly unsolved.
For example, consider a question analogous to that of deceptive misalignment: ‘Is the given genome optimized for inclusive fitness, or is it optimized for a proxy goal that deviates from inclusive fitness in certain historically unprecedented environments?’ We know that evolutionary pressures select for maximizing inclusive fitness. However, the genome is optimized not for inclusive fitness, but for a proxy goal (survive and engage in sexual intercourse) that deviates from inclusive fitness in environments that are sufficiently distinct from ancestral environments.
How did scientists find out that the genome is optimized for a proxy goal? Almost entirely from behavior. We have a coarse-grained behavioral model that is quite good and generalizable. Evolution shaped animals’ behavior towards a drive for sexual intercourse, but historically unprecedented environmental changes (e.g., widespread availability of birth control) has made this proxy goal distinct from inclusive fitness. Parsimonious models based on first principles that are likely to be correct, like the above one, have a realistic chance of achieving situation-specific predictability that generalizes out-of-distribution.
In contrast, there is still very little understanding of which genes interact to cause animals’ sex drive. Which genes affect sex drive? Probably a substantial proportion of them, and they probably interact in interconnected and nonlinear ways (including with the extremely complex, multidimensionally varying environment) to produce behavioral traits in an unpredictable manner. Moreover, a lot of the information needed to predict behavioral traits like sex drive will lie in the specific environment and how it interacts with the genome. Only the most coarse-grained of these interaction dynamics will be predictable via bypassing empiricism with a statistical-mechanics-like model, due to computational irreducibility. And such a coarse-grained model will likely be rooted in behavior-based abstractions.
Deep-learning neural nets do come with an advantage lacked by behavioral genetics and neuroscience: a potentially complete knowledge of the internal data, the environmental data, and the data of their interaction throughout the whole training process.
But there is a missing piece: complete knowledge of the deployment environment. Any internals-based model of deceptive intent that alignment researchers can come up with is only guaranteed to hold in the subset of environments that the researchers have empirically tested. In the subset of environments that the polycausal model has not been tested in, there is no a priori reason that the model will generalize correctly. A barrier to generalizability is posed by the nonlinear and interconnected interactions between the neural net’s internals and the unprecedented environment, which can and likely will manifest differently depending on the environment. Relaxed adversarial training can help test a wider variety of environments, but this is still hampered by the blind spot of being unable to test the subset of environments that cannot be instantiated at human-level capabilities (e.g., the environment in which RSA encryption is broken). Thus, my guess is that the intrinsic out-of-distribution predictability of the AGI neural net’s behavior would be low, just like that of behavioral genetics or neuroscience.
For a conceptual example, consider the fact that the dynamics of cellular automata can change drastically with just one cell’s change in the initial conditions. See Figure 1 of Beckage et al. (Code 1599 in Wolfram’s A New Kind of Science), reproduced below:
In general, the only way to accurately ascertain how a computationally irreducible agent will behave in a complex environment is to run it in that environment. Even with complete knowledge of the agent’s internals, incomplete knowledge of the environment is sufficient to constrain a priori predictability. I expect that many predictions yielded by interpretability tools in the pre-deployment environment will fail to generalize to the post-deployment environment, unless the two are equal.
Sorry for the miscommunication! I meant to say that the rate at which mechanistic interpretability will yield useful, generalizable information is slow, not zero.
But this is sufficient for concern because informational channels are dual-use; the AGI can use it for sandbox escape. We should only open an interpretability channel if the rate of scientific benefit exceeds the rate of cost (risk of premature sandbox escape by a misaligned AGI).
My opinion is that while mechanistic interpretability has made some progress, the rate at which this progress is happening is not fast enough to solve alignment in a short amount of time and computational resources. So far, the rate of progress in interpretability research has been substantially outpaced by that in AI capabilities research. I think this was predictable, due to what we know about computational irreducibility.
Roughly speaking, there is a spectrum between high-complexity paradigms of design (e.g., deep learning) and low-complexity, modular paradigms of design (e.g., software design by a human software engineer). My guess is that for many complex tasks, the optimal equilibrium strategy can be achieved only by the former, and attempting to meaningfully move towards the latter end of the spectum will result in sacrificing performance. For example, I expect that we won’t be able to build AGI via modular software design by a human software engineer, but that we will be able to build it by deep learning.
In Ethan’s scaling law, extrapolatory generalization is only guaranteed to be valid locally (“perfectly extrapolate until the next break”), and not globally. This is completely consistent with my prior. My assertion was that in order to globally extrapolate empirical findings to an unknown deployment environment, only simple models have a nontrivial chance of working (assuming realistic amounts of time and computational resources). These simple models will likely be based on parsimonious first principles that we have strong reason to be valid even in the unknown environment. And consequently, they will likely be largely based on behavioral data rather than the internal data of the agent-environment interaction dynamics.