A recent article by Max Hellriegel Holderbaum and me raised the worry that AI alignment might be impossible in principle. The article has not gotten the attention it deserves and I suspect there are three main reasons for this. First, it is quite long. Secondly, a huge chunk of it develops an intuition-based argument that is inessential to the article’s main point. Third, we inadvisably released it on April fools. In this post I will try to reconcile these flaws and present what I take to be our main point in a more concise fashion.
Very roughly, the original article argued that, first, value-aligning an AI system requires that we are able to predict some aspects of its behavior, and secondly, that there are reasons to believe that such predictions may be impossible for systems that are sufficiently intelligent. As a result, we worry that the alignment paradigm may be an infertile framework for thinking about AI safety. Lets go through these points in order.
Prediction
The first thesis is that to value-align a given AI system we have to be able to predict some aspects of its behavior.
Predicting the behavior of complex computational systems is hard, and in many instances impossible if by prediction we mean knowing the result of a computational procedure without running it. The impossibility of making such predictions reliably arguably is a motivating factor behind the shift from the control problem, the problem of controlling AI systems once they are created, to the alignment problem, the problem of creating AI systems that have goals and values aligned with our own. The emphasis on values frees one from the obligation of making precise prediction about what some AI system will do. If it could be shown however, that knowing the values of some system requires that we are able to predict some aspects of its actions, then the emphasis on alignment would be in vain.
To see that prediction is primary to alignment I will consider three popular ways of thinking about utility functions and show that none of them can justify the emphasis on alignment over prediction. The first view is the dispositionalist one. Here utility functions are conceived as generalizations about the actual dispositions a given system has. Rabbits like carrots in the sense that they tend to eat them. Such an understanding of utility functions is prevalent both in the decision theoretic paradigm of conceptualizing them in terms of betting dispositions, as well as in contemporary active inference theories of cognition that frame utility functions in terms of random dynamical attractors. In the context of the present discussion it should be evident that achieving value alignment presupposes extensive predictive abilities regarding the system in question.
A second view of utility functions is learning-theoretic. It holds that utility functions are things that intelligent systems abstract from their training data. For instance, reinforcement learning systems learning to play chess typically learn that there are more or less desirable board positions from relatively sparse training data. However, it is a well known problem that on the learning-theoretic understanding of utility functions alignment researchers are faced with the problem of knowing the utility function a system learned. One can figure this out by running it, which is feasible for chess systems but not for potentially dangerous AGIs, or by having some independent predictive strategy. Once again it turns out that knowledge of utility functions requires prediction.
A third view of utility functions is that they are hard-wired features of AI systems that are directly engineered into them. The problem with this view is that it is a technological fantasy. Current AI systems do not have hard-wired utility functions over and above loss functions and knowing its loss function is no great help in predicting a system. There is no reason to believe that this will change any time soon, or ever for that matter.
As an aside, the alignment paradigm of reinforcement learning by human feedback (RLHF) offers an interesting case study here. This approach precisely does not answer the demand for predictability because it does not render the systems in questions any more predictable. That is to say that, without some additional method for predicting what the AI systems in question will do, we simply cannot know whether RLHF will work, i.e. whether the human feedback will generalize to the contexts we want the system to be safe in.
I hope it has become clear that the difficulties for the alignment paradigm are quite systematic. As we said in our original article, the prediction problem, the problem of predicting the behavior of AI systems, is more fundamental than either the control problem or the alignment problem. And while the alignment problem may seem hard but solvable, it is unclear whether the prediction problem is solvable at all.
Computation
The second thesis is that predicting the behavior of advanced AI systems may be strictly impossible.
This is not to say that we are certain on this, but we think there are some reasons to take this possibility very seriously. In our original article Max and me investigated two arguments to this conclusion. One of these is built upon computability theory, the other is built on intuitive assumptions about the structure of intelligence. As the latter argument requires detailed discussion and this article is supposed to serve as a brief overview, I will here focus on the argument from computability.
The argument from computability goes like this. There cannot be an algorithm that decides for every Turing machine whether it will eventually halt. That’s the halting problem. It is easy to derive the conclusion that there cannot be an algorithm that decides for any other algorithm whether that algorithm will do Ω. For there is a possible algorithm that, for some input, feeds that input into an arbitrary Turing machine, and if that Turing machine halts, it does Ω. Any algorithm that is able to predict this algorithm will be able to solve the halting problem and thus there is no such general prediction algorithm.
On a more general note, it seems that where complex computational procedures are involved, unpredictability is the default. The relevance of this fact has been questioned on the grounds that, just because there can be no general prediction procedure, there may still be particular AI systems that are perfectly predictable. Compare: There cannot be a general algorithm for predicting whether a given system will behave like a calculator for some input, for there is some possible system that will run any arbitrary Turing machine and behave like a calculator if its halts. But this does not show that one cannot build reliable calculators!
Still it seems that, given that predictability is not guaranteed, the burden of proof is shifted to AI researchers. In order to guarantee that some AI system is safe one would have to show at least that it is not computationally universal in the sense that, in the course of its working, it will not implement or “make use of” arbitrary Turing machines. For if it does then computability theory tells us that its behavior will escape our predictive grasp. It is an open question whether the capacities of a system can be flexible enough to be properly generally intelligent and still constrained enough to be predictable for its creators.
Fire
The third and final thesis is that we should be willing to contemplate the possibility that AGI is inherently unsafe and should never be developed.
The point of our argument is not stating an actuality but raising a possibility. We currently do not know whether AGI can be made safe. Thus it is important that we do not think about AI safety in terms of the limiting paradigm of AI alignment where AI safety is synonymous with thinking about how to engineer safety into the systems themselves.
What would a more inclusive paradigm look like? In my view, the taming of fire offers a useful analogy here. The taming of fire certainly was one of the most important achievements in the history of mankind. However, just because you can make it does not mean that it is a good idea to make as much of it as possible. And making fire safe does not mean to engineer safety into it from the get go but rather to figure out where to use it to solve tightly constrained tasks without burning the house down.
Taming the Fire of Intelligence
A recent article by Max Hellriegel Holderbaum and me raised the worry that AI alignment might be impossible in principle. The article has not gotten the attention it deserves and I suspect there are three main reasons for this. First, it is quite long. Secondly, a huge chunk of it develops an intuition-based argument that is inessential to the article’s main point. Third, we inadvisably released it on April fools. In this post I will try to reconcile these flaws and present what I take to be our main point in a more concise fashion.
Very roughly, the original article argued that, first, value-aligning an AI system requires that we are able to predict some aspects of its behavior, and secondly, that there are reasons to believe that such predictions may be impossible for systems that are sufficiently intelligent. As a result, we worry that the alignment paradigm may be an infertile framework for thinking about AI safety. Lets go through these points in order.
Prediction
The first thesis is that to value-align a given AI system we have to be able to predict some aspects of its behavior.
Predicting the behavior of complex computational systems is hard, and in many instances impossible if by prediction we mean knowing the result of a computational procedure without running it. The impossibility of making such predictions reliably arguably is a motivating factor behind the shift from the control problem, the problem of controlling AI systems once they are created, to the alignment problem, the problem of creating AI systems that have goals and values aligned with our own. The emphasis on values frees one from the obligation of making precise prediction about what some AI system will do. If it could be shown however, that knowing the values of some system requires that we are able to predict some aspects of its actions, then the emphasis on alignment would be in vain.
To see that prediction is primary to alignment I will consider three popular ways of thinking about utility functions and show that none of them can justify the emphasis on alignment over prediction. The first view is the dispositionalist one. Here utility functions are conceived as generalizations about the actual dispositions a given system has. Rabbits like carrots in the sense that they tend to eat them. Such an understanding of utility functions is prevalent both in the decision theoretic paradigm of conceptualizing them in terms of betting dispositions, as well as in contemporary active inference theories of cognition that frame utility functions in terms of random dynamical attractors. In the context of the present discussion it should be evident that achieving value alignment presupposes extensive predictive abilities regarding the system in question.
A second view of utility functions is learning-theoretic. It holds that utility functions are things that intelligent systems abstract from their training data. For instance, reinforcement learning systems learning to play chess typically learn that there are more or less desirable board positions from relatively sparse training data. However, it is a well known problem that on the learning-theoretic understanding of utility functions alignment researchers are faced with the problem of knowing the utility function a system learned. One can figure this out by running it, which is feasible for chess systems but not for potentially dangerous AGIs, or by having some independent predictive strategy. Once again it turns out that knowledge of utility functions requires prediction.
A third view of utility functions is that they are hard-wired features of AI systems that are directly engineered into them. The problem with this view is that it is a technological fantasy. Current AI systems do not have hard-wired utility functions over and above loss functions and knowing its loss function is no great help in predicting a system. There is no reason to believe that this will change any time soon, or ever for that matter.
As an aside, the alignment paradigm of reinforcement learning by human feedback (RLHF) offers an interesting case study here. This approach precisely does not answer the demand for predictability because it does not render the systems in questions any more predictable. That is to say that, without some additional method for predicting what the AI systems in question will do, we simply cannot know whether RLHF will work, i.e. whether the human feedback will generalize to the contexts we want the system to be safe in.
I hope it has become clear that the difficulties for the alignment paradigm are quite systematic. As we said in our original article, the prediction problem, the problem of predicting the behavior of AI systems, is more fundamental than either the control problem or the alignment problem. And while the alignment problem may seem hard but solvable, it is unclear whether the prediction problem is solvable at all.
Computation
The second thesis is that predicting the behavior of advanced AI systems may be strictly impossible.
This is not to say that we are certain on this, but we think there are some reasons to take this possibility very seriously. In our original article Max and me investigated two arguments to this conclusion. One of these is built upon computability theory, the other is built on intuitive assumptions about the structure of intelligence. As the latter argument requires detailed discussion and this article is supposed to serve as a brief overview, I will here focus on the argument from computability.
The argument from computability goes like this. There cannot be an algorithm that decides for every Turing machine whether it will eventually halt. That’s the halting problem. It is easy to derive the conclusion that there cannot be an algorithm that decides for any other algorithm whether that algorithm will do Ω. For there is a possible algorithm that, for some input, feeds that input into an arbitrary Turing machine, and if that Turing machine halts, it does Ω. Any algorithm that is able to predict this algorithm will be able to solve the halting problem and thus there is no such general prediction algorithm.
On a more general note, it seems that where complex computational procedures are involved, unpredictability is the default. The relevance of this fact has been questioned on the grounds that, just because there can be no general prediction procedure, there may still be particular AI systems that are perfectly predictable. Compare: There cannot be a general algorithm for predicting whether a given system will behave like a calculator for some input, for there is some possible system that will run any arbitrary Turing machine and behave like a calculator if its halts. But this does not show that one cannot build reliable calculators!
Still it seems that, given that predictability is not guaranteed, the burden of proof is shifted to AI researchers. In order to guarantee that some AI system is safe one would have to show at least that it is not computationally universal in the sense that, in the course of its working, it will not implement or “make use of” arbitrary Turing machines. For if it does then computability theory tells us that its behavior will escape our predictive grasp. It is an open question whether the capacities of a system can be flexible enough to be properly generally intelligent and still constrained enough to be predictable for its creators.
Fire
The third and final thesis is that we should be willing to contemplate the possibility that AGI is inherently unsafe and should never be developed.
The point of our argument is not stating an actuality but raising a possibility. We currently do not know whether AGI can be made safe. Thus it is important that we do not think about AI safety in terms of the limiting paradigm of AI alignment where AI safety is synonymous with thinking about how to engineer safety into the systems themselves.
What would a more inclusive paradigm look like? In my view, the taming of fire offers a useful analogy here. The taming of fire certainly was one of the most important achievements in the history of mankind. However, just because you can make it does not mean that it is a good idea to make as much of it as possible. And making fire safe does not mean to engineer safety into it from the get go but rather to figure out where to use it to solve tightly constrained tasks without burning the house down.