The alignment problem is less fundamental for AI safety than the problem of predicting actions of AI systems, especially if they are more intelligent than oneself. We dub this the prediction problem.
The prediction problem may be insoluble.
If the prediction problem is insoluble, predicting the results of the advent of AGI would be impossible. If this is so we have to radically rethink our approach to AI safety and its relation to politics.
Introduction
It is plausible that AI research may result in the creation of systems that are much more intelligent than their human creators. Starting from this assumption it is imperative that we ensure that such a creation would be, on the whole, an event that is beneficial to humanity. Research into how this may be ensured is generally done under the label of AI safety. Our aim in this paper is threefold. First, we argue that the most fundamental problem of AI safety research is that of predicting the actions of AI systems, particularly ones that are more intelligent than us. Secondly, we argue that there are some reasons to suspect that this problem may be insoluble. Finally, we sketch what AI safety research should look like if we were forced to the conclusion that AGI is inherently unsafe.
The most fundamental problem in AI safety lies not in choosing the right kind of goals that align with ours, or in how to control an agent more intelligent than oneself, although these are indeed important problems and formidable challenges. We think that the most fundamental problem in AI safety research is instead that of predicting the behavior of AI systems, which becomes more difficult and pressing the more intelligent these systems become. On the most abstract level, all AI safety research is about trying to make AI systems that produce outcomes that belong to some target set Ω. The abstract nature of what we are trying to show here does not require us to specify Ω. It may be that Ω is the set of outcomes where humanity does not end up extinct, resulting in minimally safe AI. It may be that Ω is the set of outcomes where all humans end up in a state of perpetual utopia, resulting in maximally beneficial AI. Or it may be that Ω is defined more narrowly, as when we require that the AI system in question does not lie to its operators, resulting in truthful AI. Evidently, to ensure that some AI system will achieve or work towards goals in Ω, we first will have to predict whether its actions will achieve states in Ω or at least make such states more likely. We will call any method for predicting whether the outcomes of running some AI system are in Ω a prediction algorithm for that AI system and for that particular target set. We will further call the problem of creating viable prediction algorithms for future AI systems the prediction problem. We will later argue in more detail that the prediction problem is more fundamental than the problem of AI value alignment.
To cut straight to the chase, we fear that there is a paradox implicit in the prediction problem. Here is our first pass at this paradox. We will later show in more detail why the paradox cannot be easily avoided. The paradox, we suppose, arises because there is an inherent connection between intelligence and prediction. Generally, intelligence is closely tied to predictive ability. Imagine two systems S1 and S2. Predicting S1 will get increasingly difficult as the system becomes more intelligent, since with increasing intelligence, S1 makes itself use of more efficient methods for prediction. It seems that intelligence would then be precisely the kind of property that makes a system hard to predict. Or put another way, if the actions of system S1 are easily predictable by a supposedly “dummer” system S2, then system S1 arguably is not so intelligent after all. The idea is thus that there is a trade-off between the intelligence of a system and how predictable it is. Should it turn out that we can make this intuition more precise, and if we are right to claim that the problem of prediction is more fundamental than the problem of value alignment, then most current approaches to AI safety will be in doubt.
We will proceed as follows. Section two will discuss the connection between prediction and AI safety in more detail. Here, we will show that the prediction problem is the most fundamental challenge in AI safety research. Section three will briefly discuss some previous work on the prediction problem. Section four will develop what we call the self-prediction argument against the possibility of viable prediction algorithms and thus the insolubility of the prediction problem. Section five will discuss possible strategies for circumventing the self-prediction argument. Section six will discuss how AI safety research should look like in a world where the prediction problem is indeed insoluble. Section seven ends with some reflections on the light our discussion sheds on the second questions asked by the Open Philanthropy AI Worldviews Contest.
The Centrality of Prediction
The most widely discussed problem of the AI safety discussed in the literature is the alignment problem. Here we try to show that solutions to the alignment problem require that one first solve the prediction problem, the problem of predicting the behavior of AI systems, especially ones that are more intelligent than us. The alignment problem is the problem of building AI systems whose values align with our own (Bostrom 2014). Importantly, it seems that any solution to the alignment problem presupposes a solution to the prediction problem, for knowing whether a system is aligned with our values or not entails that one knows what kinds of states it is going to work towards, what it is going to do.
The centrality of prediction in AI safety research can be obscured by an intriguing but fallacious argument. According to this argument, we do not need to be able to make predictions about the behavior of our AI systems because it is sufficient to know that their goals or, their utility functions, are aligned with our own. For instance, I do not need to be able to predict the exact next move that AlphaGo is going to do. It is sufficient to know that its next move is going to bring it closer to winning the game. Knowledge about goals or utility functions should then be sufficient to know that some AI system is in fact safe. No detailed prediction of the system’s behavior is necessary.
While it is correct that ensuring safety does not require that we predict every output of an AI system, which would defeat its purpose, we do have to predict whether its behavior tends to bring about states in Ω. We are still faced with the prediction problem. Solutions to the prediction problem that make use of utility functions are based on a confusion around the ontology of these functions. There are broadly three views on the ontology of utility functions and the above strategy turns out to be question-begging for each of them. First, one may hold that the utility function of a system is a generalization about the system’s actual behavior. Such a view is taken by popular active inference theories of cognition and behavior. Here, the utility function of a system is ultimately defined by that system’s dynamical attractor. In effect, the utility function of a system is defined by the kinds of states the systems tends to inhabit (Parr, Pezzulo, and Friston 2022): Rabbits desire carrots in the sense that they tend to eat them. Evidently, on this view, presupposing knowledge of the utility function of a system to circumvent the prediction problem would be circular. For on this view, facts about whether some system wants to hurt humans are predicated on whether the system actually does tend to hurt humans.
A second view on the ontology of utility functions is that they are learned features that AI systems acquire in the process of training. The utility function of the AI evolves in its training process where it minimizes its loss function in an iterative process in which its performance is judged and optimized with respect to training data. Suppose we train a robot to find the red square within a labyrinth. Once the robot becomes good at the task we may suppose that it has learned that finding the red square by moving around is good, i.e., is assigned high utility. But how can we know whether this is actually what the system has learned? It may turn out that, once we release the robot into the real world and give it access to red paint, it starts painting every centimeter of the ground red because it hasn’t learned to solve labyrinths but to prefer red as a color of the ground. We have no guarantee that the utility function a system learns during training is in fact the one we intended. And again, figuring this out seems to already require that we are able to predict what the system is going to do. Indeed, the problem of ensuring that the learned utility function of an AI aligns with intended utility function is well-recognized and has come to be known as the inner alignment problem (Hubinger et al. 2019). As there currently seems to be no agreed-upon solution to this problem, one cannot circumvent the prediction problem by appealing to utility functions, conceived of as something obtained during learning.
A third and final view on utility functions conceives them as hard-wired features of AI systems that are engineered into them. Now we grant that if we knew the hard-wired utility function of some AI system, this would solve the prediction problem. However, as far as we can see, hard-wired utility functions of the kind required are a technological fantasy. The systems most promising in the quest for AGI are trained by minimizing some known but uninteresting (as far as prediction is concerned) error function defined over its input and output. They do not possess hard-coded utility functions over and above these error functions. And we see no reason to suppose that progress in AI will result in systems that do any time soon.
We conclude that presupposing significant knowledge of utility functions in solving the prediction problem is either circular or based on entirely stipulative technological innovations. Were we able, be it by learning or hard coding, to reliably specify the utility function of an AI system, this would bring us close to solving the prediction problem. But for the moment, it seems to us that the prediction problem needs to be solved in order to make any headway towards AI safety. In particular, it is more fundamental than the alignment problem because solving the latter presupposes solving the former. In the following section, we briefly discuss some previous work on the prediction problem.
The Prediction Problem and Computability
We are not the first to suspect that the prediction problem may present a deep challenge to the efforts of producing safe AI. Alfonseca et al. (2021) have argued that there cannot be a single algorithm that is capable of deciding, for any specified algorithm and any input, whether it is safe to run for its human creators. Let’s call such a putative algorithm a general (safety-)prediction algorithm. Note that Alfonseca et al.’s focus on safety rather than some other target set Ω is unimportant to their argument. The reason there cannot be a general safety-prediction algorithm is grounded in computability theory, and the halting problem specifically. This is a well-known result of computability theory that says that there cannot be an algorithm that decides, for any given algorithm and input, whether the algorithm will eventually halt. From this, it is easy to deduce the mentioned result. For there may be an AI algorithm that, for every given input, feeds this input into a Turing machine and, if the machine halts, starts hurting humans. A safety algorithm that applies to this AI system would be capable of solving the halting problem which, as mentioned, cannot be done. Thus there cannot be a general safety-prediction algorithm.[1]
The main weakness of the argument provided by Alfonseca et al. lies in its generality of scope. No particular assumptions are made about the algorithm whose safety is to be predicted, except that intelligent systems are able to implement (or “use”) Turing machines. The general safety-prediction algorithm is impossible because it is impossible to have a procedure that decides for every program whether it halts, and as set up in the paper, thereby whether it hurts humans. Less ambitious procedures remain possible. If the provided argument would indeed block research into the predictability of AI systems, then parallel arguments would rule out the predictability of all kinds of software. For instance, there is no general method for deciding whether any given program will behave like a calculator for any given input, because we can construct an algorithm that behaves like a calculator for some input, if a given Turing machine halts, and not if not. This impossibility result hardly infringes on the development of reliable calculators. Or on the development of reliable and predictable software in general for that matter.[2]
So while this impossibility result entails that there can be no general safety algorithm that determines the safety of all AI systems, it does not follow that making AI safe is impossible. For there may still be a particular safety-prediction algorithm that may be applied to many, but not all, algorithms and thereby AIs systems. We only need to require that an AI system’s harmful behavior does not depend on running arbitrary Turing machines on arbitrary input and the result provided by Alfonseca et al. does not apply to this restricted class of AIs (Sevilla and Burden 2021).
Still, the argument of Alfonseca et al. is not without consequences. The burden of proof, in the case of AI safety, is on those who hold that some system is safe. The argument under discussion shows however, that strictly speaking, this can only be done for AI systems that are not able to implement every Turing machine. It is a plausible feature of highly evolved intelligence, as we usually conceive of it, that it entails the capacity to simulate every kind of Turing machine (at least when the environment can be used for data storage). Notably, this is true for LLMs (Schuurmans 2023). It seems that, where complex computational processes are concerned, unpredictability is the default. Thus, it is the burden of AI engineers and AI safety researchers to show that there are reliable methods of building highly intelligent systems that are not computationally universal in the relevant way. This is a requirement for safety and an important challenge, to which we currently lack solutions and promising strategies. This is what makes the case of AI systems different from, say, calculators.
That being said, we believe that, given some natural assumptions about the nature of intelligence, a more general, though less rigorous argument for the insolubility of the prediction problem can be made. We will now try to establish is that there not only cannot be a general prediction algorithm but there also cannot be a specialized prediction algorithm that decides for some particular highly intelligent system whether its behavior will fall within some target set Ω.
The Self-Prediction Argument
It seems to us that an essential feature of intelligence is the ability to engage in what we call deliberation. Deliberation is the process of making up one’s mind on some issue, be it theoretical or practical. Abstractly, this is a way of processing input to arrive at either different behavioral outputs or doxastic states (beliefs). As deliberation alters the probabilities of different behaviors, it also alters the probabilities of behaviors that tend to result in outcomes outside of Ω. We may think of this type of processing as weighing reasons for or against some action or belief. Any future AI system that truly deserves the label of intelligence will be able to deliberate on a huge number of such theoretical and practical issues. For the moment we will assume that a viable prediction algorithm would have to be a relatively simple procedure that predicts the probability of some system’s behavior bringing about results outside of Ω.
So far these are rather innocent assumptions. The more substantial assumption of our argument is that no intelligent system can predict the results of its own deliberation with certainty. We call this the self-prediction assumption. It is important here that “to predict” the results of a deliberation means to know the results in advance, before engaging in the deliberation itself. The intuition here is that it is necessary to go through the relevant deliberative steps to know the result of the deliberation. The results depend irreducibly on the relevant weighing of reasons. Even when one has some clue about what one will do or think after making up one’s mind about some issue, before one has actually engaged with the topic in detail, with the reasons for and against particular beliefs, it is always possible that, after deliberation, one’s prior assessment turns out to be incomplete or false.
But now there is an evident problem. Since if there exists a safety-prediction algorithm for some AI system, the AI system will itself be able to use that algorithm. Thus the system will be able to predict with certainty the results of its deliberations that alter the probabilities of its behaviors towards outcomes in Ω. However, we just argued that predicting one’s own deliberation is impossible in general. We can thus formulate the following simple argument:
(1) No intelligent system can predict the results of its deliberation with certainty.
(2) If there is a safety-prediction algorithm for a sufficiently intelligent agent, then the agent can use it to predict with certainty the results of its deliberations which alter the probabilities of its behavior towards actions with outcomes outside Ω.
(3) There can be no such safety-prediction algorithm.
The upshot of the argument is that intelligence is unlike other computational activities, like that of a calculator. Deliberation, which we argued we should expect in intelligent agents in general, is essentially something that irreducibly depends on the weighing of reasons. The reasons of a highly intelligent entity about whether or not to engage in behaviors that we see as harmful may be complex. But this conflicts with the existence of a safety algorithm which we argued has to be a relatively simple procedure. If there was such a simple procedure however, it would mean that an AI system could cut its own deliberation short by applying its own safety algorithm. And this seems to conflict with the nature of deliberation. Thus such a prediction algorithm cannot exist. In a nutshell the argument suggests that complex thought cannot be predicted by simple means. If this were correct then AI research likely to result in anything resembling AGI would be intrinsically unpredictable and thus intrinsically unsafe.
Discussion of the Argument
Our discussion of the self-prediction argument will consist in a discussion of its premises. Maybe the most interesting question is whether the first premise, the self-prediction assumption, is correct. We do not have an argument from first principles on offer here. Our defense will thus assume that the principle is indeed intuitively plausible and our discussion will consist in clearing up two possible misconceptions.
Firstly, resistance to the self-prediction assumption may stem from the impression that its puts tight limits on the predictability of deliberation. But, one may hold, as naturalists we should believe that deliberation is no more unpredictable than any other kind of process, given that we can figure out the mechanism behind it. Thus, given knowledge of the relevant mechanisms, deliberation should be rendered predictable, even for the agent that is deliberating.
In fact, the self-prediction assumption does not conflict with the fact that intelligent behavior is predictable. For instance, an external agent may well be able to predict in principle whether I will decide to have pasta or ratatouille for dinner on the basis of knowing my brain states and dynamics in sufficient detail.[3] But this will not help me to engage in self-prediction. For the process of figuring out how I will decide on some issue on the basis of my brain states will be much more computationally demanding than the deliberative activity itself. This harkens back to the view that safety-prediction algorithms need to be simpler than the algorithm they apply to. The implausibility of self-prediction is precisely grounded in the idea that deliberation cannot be cut short, even if is inherently deterministic.
Secondly, one may hold that the self-prediction assumption seems plausible precisely because we possess human minds and not superhuman artificial ones, i.e. it may be anthropocentric. If the impossibility of self-prediction is a feature merely of human intelligence rather than one intrinsic to the structure of intelligence itself, then our argument would indeed be invalidated. This raises the question in how far we can generalize from the human case to the properties of “mind space” in general. While there are some features of human minds that are plausibly general features of minds tout court (capacity of Bayesian inference, capacity to self-model, the possession of goals, etc.) there are also features that plausibly are contingently human (separation into motivational and higher functions, predominance of three-dimensional spatial representations, massive parallel processing etc.).
Evaluating whether the impossibility of self-prediction belongs to the former or the latter camp is hard. However, we think that the plausibility of the self-prediction assumption is likely grounded in the logical structure of deliberation and reasoning rather than the structure of human psychology. In order to arrive at a conclusion about whether or not action A should be carried out, an intelligent agent will weigh reasons for or against A. Whether or not A is done will irreducibly depend on these reasons in the sense that there is no way of circumventing the weighing of pros and cons in order to arrive at a result. Crucially, at no point this reasoning appeals to any particularly human characteristics. Rather, it appeals to the fact that intelligent beings are moved by reasons. It is possible that the whole apparatus of “reasons” and “deliberation” is built on the contingent structure of human psychology. But we are not willing to bet the future of terrestrial life such a radical anti-rationalism. In summary, the self-prediction assumption is neither in conflict with determinism and naturalism nor is it anthropocentric.
Premise two, the assumption that the existence of a prediction algorithm would enable self-prediction, can be defended by the following simple argument.
(2.1) A prediction algorithm for an AI system can be used to predict the results of its deliberative activity that changes the probabilities of its behavior towards actions with outcomes outside Ω.
(2.2) A sufficiently intelligent agent can make use of any computational mechanism sufficiently simpler than itself.
(2.3) A safety-prediction algorithm has to be simpler than the intelligent system to which it applies.
(2) If there is a safety-prediction algorithm for a sufficiently intelligent agent, then the agent can use it to predict with certainty the results of its deliberations which alter the probabilities of its behavior towards actions with outcomes outside Ω.
Our discussion of premise two will consist of a discussion of the three sub-premises. The first sub-premise may be attacked using a probabilistic strategy. We defined a safety algorithm as whatever kind of prediction mechanism is powerful enough to predict that an AI system is safe. It may be argued that such safety does not require there to be logical certainty about the future behavior of the relevant system. On this view, it would be sufficient to know that an AI system with high probability will not e.g. hurt humans.
Accepting this contention results in a weakened form of the second premise according to which a sufficiently intelligent system could predict its own decision regarding whether to hurt humans with high probability. In our view, this version of the second premise is too weak to sustain the self-prediction argument. For to reach the original conclusion one would have to replace premise one with a suitably weakened thesis. Such a weakened premise would hold that no intelligent system can predict the outcomes of its own deliberation with high probability. But this weakened self-prediction assumption strikes us as implausible, for humans regularly engage in such tasks of probabilistic self-prediction. I can predict with relatively high confidence, for instance, that I will not have ice cream for dinner even without engaging in detailed reflection on this issue. Therefore, the upshot of the self-prediction argument is that provably safe, provably beneficial, provably truthful AI, and so on seems impossible, while it remains silent on predictions of safety or beneficence algorithms that give probabilistic outputs.
Still, not all is well. First and foremost, as the stakes are high, an appeal to probabilistic arguments in the context of AI safety is inherently problematic. For instance, it would be insufficient to show that it is merely plausible that some general AI is safe to justify its implementation. We suggest that any suitable probabilistic prediction method has to result in predictions that provably fall within some predefined error margin. As far as we can see, there are no reliable methods for producing predictions of this kind for the kinds of systems that are likely to exhibit general intelligence.
Secondly, there are independent reasons to suspect that there are fundamental obstacles to probabilistic prediction of systems more intelligent than oneself. Any system trying to control an AI system without relying on some provably sound prediction algorithm will fall prey to the good regulator theorem. This theorem says that every system that tries to control some other system that is subject to random perturbations will have to be structurally isomorphic to that system (Contant and Ashby 1970). This is sometimes expressed as the proposition that every good regulator of some system must possess a model of that system. In the context of AI safety, such a controlling system would defeat its purpose for two reasons. First, it would need to have similarly complex or more complex than the AI which is to be investigated. This likely is somewhere between immensely costly and infeasible in this context. But even more importantly, since the controlling system is structurally isomorphic to the AI which we want to investigate, it is potentially just as in need of prediction as the original AI system we are trying to make safe. We therefore could not use such a safety test without recreating similar risks which we try to avoid in the first place. While further discussion is necessary here, we do think that the good regulator theorem makes a good initial case against the existence of viable probabilistic prediction algorithms. In general, we think that the good regulator theorem deserves the close attention of AI safety researchers. In summary, the probabilistic strategy would be viable only if there were reliable methods for estimating error margins and if there were a convincing argument why the strategy does not fall prey to the good regulator theorem.
A prima facie reasonable strategy of circumventing the oddities of self-prediction and still retain the possibility of prediction or control is to insist that AI systems could be designed in such a way as to be constitutively incapable of applying their own safety algorithm. This would be a way to invalidate the second sub-premise. Such a design task may be achieved by either prohibiting the AI system to run the safety algorithm itself or by prohibiting the AI system to know its own source code which the safety algorithm takes as input. Thus there is some subset of the set of conceivable AIs to which the self-prediction argument cannot be applied. Unfortunately, this strategy seems like an ad hoc answer rather than a solution since we have no clue how any of these solutions could be achieved practically. We have no idea how one may even start to build an agent capable of flexible learning as required for general intelligence that is constitutively incapable of learning and doing some relatively simple things. While this may be a mere limitation of current AI it is also possible that this may also turn out to be a general limitation.[4] At any rate, this restrictionist strategy does not seem promising.
Finally, one may challenge the third sub-premise by holding that there may be viable safety algorithms that are more computationally complex than the intelligent system to which they apply. While the implementation of such an algorithm may then seriously increase the computational demands of running an AI system, the idea is certainly not out of the question. The obvious problem for this approach is that one needs to ensure that the relevant prediction algorithm is not itself a candidate for general intelligence. Otherwise, we would face a vicious regress of prediction systems. As the existence of such a “stupid but complex” prediction system is purely stipulative we think that the third sub-premise is plausible.
We conclude that the self-prediction argument is rather solid. If one thinks that the self-prediction assumption is intuitively plausible then one has three options left for solving the prediction problem. The first of these is the probabilistic strategy, which tries to make probabilistic predictions about the AI system’s behavior. Absent any provably sound algorithmic approaches this strategy will have to find some way of making predictions with precise error margins and maybe also a way of circumventing the good regulator theorem, the discussion of which goes beyond the scope of our article. The second possible strategy for circumventing the self-prediction argument is the restrictionist strategy of building AIs that are incapable of self-prediction. The problem here is that it is far from clear whether this can be done. Finally, there may be a prediction algorithm that is more complex than the AI without being similarly intelligent itself, thereby avoiding a regress of the prediction problem. None of these are obviously promising.
In the introduction, we mentioned that we suspect that the self-prediction argument is merely an instance of the more fundamental fact that there is a trade-off between predictability and intelligence. If we are right, this would make it unlikely that any simple workaround is available here. In particular, it would make it unlikely that any of the three aforementioned strategies will bear fruit. Until a more rigorous argument or proof is on the table, we can however not be certain of this point.
Importantly, the burden of proof lies on the side of those who want to employ AI systems likely to result in anything resembling AGI. If it turns out that our argument remains plausible under sufficient scrutiny, then it offers a decisive reason not to employ any such AI system, even if the argument cannot be strengthened into a formal proof.
Pessimism about AI Safety
If the issues we raised cannot suitably be addressed, then research on advanced AI will be inherently unsafe. Even if it turns out after deployment that advanced AI systems do not pose any threat to humanity, maybe because it is discovered that there is some yet unrecognized connection between intelligence and moral virtue, if this wasn’t known before the deployment, the relevant system wasn’t safe in the sense required.[5] In the following, we will thus refer to AI research which we do not know to be safe as potentially dangerous, i.e. the relevant sense of “potentially” is epistemic. This entails a crucial question: What if there will always remain strong reasons to doubt the safety of deploying candidates for AGI and thus it will always be a potentially dangerous technology? What if the deployment of such systems will always involve a gamble? We call this view pessimism about safe AI.[6] Our treatment here will be superficial as this topic is well worth a book-length treatment—or rather many book-length treatments.
First, one may ask whether one should restrict AI research at all—even if one did not know it to be safe. There are a number of possible motivations here. First, one may think that the evolution of technology is essentially a natural process that cannot be effectively controlled by policymakers. But this viewpoint is myopic. Nuclear weapons are an interesting case in point here. The development of nuclear weapons is pretty much impossible for private actors, and politically costly for most political ones due to the perceived dangers of nuclear proliferation. While we do not live in a world free of nuclear weapons, it is safe to say that nuclear proliferation has been greatly halted by sheer political will. There is no reason to think that something similar could not be done in the case of potentially dangerous AI technology. Certainly, such policies would not work perfectly, but it is equally certain that they could greatly decrease the risk of the development of dangerous AI.
A second reason for not infringing on dangerous AI research may be the view that progress always entails a certain risk. However, as the stakes are existential, we do not consider this position worthy of discussion. If AI technology bears the risk of ending human life, then its development motivated by some vague intuition of progress is plainly mad.
On the pessimist scenario, it seems that the goals of AI safety research are bound to change substantially. Rather than investigating how AI may be made safe, AI safety research should then focus on minimizing the risk that true AGI is ever achieved. This involves two challenges. First, the demarcation challenge is the challenge of differentiating potentially dangerous (i.e. potentially resulting in AGI) and benign AI research. Second, the proliferation challenge is the challenge of suppressing AI technology that is deemed potentially dangerous.
Different schools of AI will result in different views on the demarcation problem. Hawkins (2021) has suggested that the problem-solving aspect of intelligence can effectively be disentangled from its motivational aspects.[7] Here, virtually all AI research can be done in such a way that the chance of the emergence of a truly autonomous intelligent agent is minimal. On the other extreme, more cybernetically inclined approaches like active inference would suggest that any system for controlling “physiological” parameters may result in super-intelligent agency if the control process becomes sophisticated enough.[8]
In our view, it is unlikely that the demarcation challenge will be solved in any straightforward way. Rather, we suspect that there will be some degree of plausible reasoning involved in deciding whether some approach to AI is benign or not. Still, the development of general guidelines and principles is a central goal of AI safety research on the pessimist’s agenda.
We will only have some short comments on the proliferation problem as it is primarily a political issue. Here too, there will be two broadly different strategies, a liberal path, incentivizing only against those AI technologies deemed most dangerous, and a path of strong regulation, banning all kinds of modestly dangerous research.
While one may think that strong regulation will always offer maximum security, this is not obvious. For the stronger the political measures, the greater the incentive to work against or circumvent them. On the other hand, even liberal path, if it is to be effective at all, would have to involve massive regulation, presumably by state actors. For instance, considering the progress in computer technology, the personal computers of 2035 may have the calculation power only supercomputers achieve today. It may thus be feasible to implement a superhuman AI on such a machine. The logical consequence is that either one would have to put legal limits on the computational power of personal computers, implement some kind of surveillance state, or use some hardware-based enforcement that prevents the implementation of potentially dangerous AI software. The latter strategy would not be without precedent, even though on a different scale. Enforcing limits on the usage of dangerous technology is well-known in the world of bio-tech. For instance, there is the secure DNA project, which aims to screen orders for DNA synthesis which are too dangerous to be publicly available. As bio-tech labs screen for requests that are too dangerous today, we can imagine a future where providers of processing power screen for potentially dangerous AI applications.
We admit that none of the available political options sound particularly appealing. This is why, in our view, given the progress in AI capabilities and current AI safety techniques, the question of how AI technology may be controlled effectively with minimal cost of liberty should be an active area of research.
All these strategies involve massive political challenges. Any ban on AI research would be useless unless agreed upon by all the powerful political actors and even hardware approaches would require a substantial amount of control, coordination, and enforcement. In a pessimistic scenario, the worst thing that may happen is rapid progress of AI technology in the context of some kind of arms race, be the actors companies striving for market dominance, state actors in some kind of (hot or cold) war, or both. Collaboration between all the relevant actors on the issue on national and international levels would be imperative. On the bright side, the threat of malicious or misaligned AGI may be a sufficiently motivating force to entail increased cooperation if it is sufficiently tangible. Unlike the rather abstract danger of global warming, the threat of a rouge AI can be intuitively grasped and already has a predominant foothold in the public imagination. This is clearly shown by a recent poll where 55% of the general public in the US was somewhat or very worried about existential risk from AI. This is one reason we see for expecting a political reaction.
A second reason for a constrained form of optimism is the following. A strong societal and political response against dangerous AI research after a moment of realization on the issue is likely, unless the further deployment and takeoff of AGI happens in a very specific way. In our view, one should expect at least a wake-up moment akin to the one we had with Covid, unless one believes in extremely fast takeoff speeds and thus does not expect many more misaligned AI models of increasing capacities to be released before this takeoff. If either of these does not hold, non-existential demonstrations of the dangers of unconstrained AI research are likely to occur, thereby making voters and policymakers aware of the issue. We are uncertain about how society will respond to such an event but it seems likely that that what is politically and socially possible would change as a result. Finally, we want to emphasize that, as soon as we leave behind the standard AI safety paradigm and take the pessimist scenario seriously, it quickly becomes evident that all solutions to the problems of AI will have to involve political action.
Some Conclusions
One of the tasks of the Open Philanthropy AI Worldviews Contest is to estimate the probability of doom scenarios due to loss of control over an AGI system, given the development of AGI by 2070. The strategy of our paper was to go meta on this question. Our reasoning suggests that any such estimate would be dubious since the behavior of systems that are more intelligent than us lies behind an epistemic event horizon. More abstractly, we want to suggest that the lead question is predicated on a paradigm of AI safety that constrains the field. According to this paradigm, the task of AI safety research consists primarily in finding technical solutions for aligning AI systems. As we argued, such an approach presupposes that the alignment problem is indeed solvable, which in turn presupposes that the prediction problem is solvable. But as we have shown, there are good reasons to be skeptical of this assumption.
Leaving this paradigm behind, AI safety researchers should start to seriously ask the question: What if AI is inherently unsafe? Once this question is on the table, it quickly becomes evident that appeals to technological optimism or the inevitability of technological progress simply will not do. So far the dominant paradigm has been that AGI is essentially unavoidable and only has to be built in the right way or controlled well enough. We want to consider a different paradigm. In this paradigm, intelligence may be a little bit like fire. Just because you have learned to make it, does not mean it is wise to make as much of it as possible. And its responsible use does not consist in figuring out how to make it inherently safe since that is impossible, but in figuring out how to employ it to achieve specific tasks without burning the house down.
We now want to be a little more specific on the implications of our considerations for the question of the probability of an AI doomsday scenario, given we reach AGI. In light of what we have said, we have to abstain from assigning probabilities to the probability of an existential catastrophe by AGI. We do however see our argument making a case that existential catastrophe is more likely. Since we see the prediction problem as underlying the alignment problem and the prediction problem as potentially insoluble, the argument entails that current efforts to achieve AI alignment have relatively little impact on the probability of AI doomsday scenarios. Additionally, we think that the currently on vogue approach of reinforcement learning with human feedback (RLHF and variations thereof) should not update one’s probability of an existential catastrophe by AGI. This is because RLHF and its variations precisely do not render the systems in question any more predictable. There is no reason to believe that the human feedback provided in the training process will generalize in the right manner to the point of deployment, making the approach essentially worthless as a response to the prediction problem and thereby the alignment problem.
Before concluding, we want to be absolutely clear about what our argument is not. It is not that since AI alignment is clearly doomed, there is no point in working on it. The argument should not at this point, and not in its current form, discourage any AI safety research. Neither the argument from computability, nor the self-prediction argument would justify this conclusion. The argument from computability simply does not entail that AI alignment research is doomed to failure, but merely that there are no solutions that apply to AI systems generally. The self-prediction argument relies on intuitive assumptions and might as such be wrong or misguided. But since these intuitions seem rather stable to us, we think that we should at the very least take their implications seriously. Our argument is also not that AI is inherently unsafe and we should thus implement an anti-technological global surveillance state that suppresses the progress of computer technology. Rather, our point is that while solving the prediction problem is necessary for solving the alignment problem, it has received little attention in AI safety work, and despite several recent expressions of pessimism on AI safety, the possibility that there may be no way to make AGI safe has rarely been discussed seriously. The question of how to reasonably address this possibility has at the same time been discussed even less. Sufficient time, money, and energy should be allocated to this task.
When we started writing this essay, we thought that our opinion, namely supporting strong political measures against the further increase of AI capabilities, was likely to be a fringe position. We see it as a positive sign positive that this has become morewidespread.
References
Alfonseca, Manuel et al. (2021). “Superintelligence Cannot Be Contained: Lessons from Computability Theory”. In: J. Artif. Int. Res. 70, pp. 65–76.
Ashby, W. Ross (1947). “Principles of the Self-Organizing Dynamic System”. In: The Journal of General Psychology 37.2, pp. 125–128.
Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
– (2019). “The Vulnerable World Hypothesis”. In: Global Policy 10.4, pp. 455–476.
Contant, Roger C. and W. Ross Ashby (1970). “Every good regulator of a system must be a model of that system”. In: International Journal of Systems Science 1.2, pp. 89–97.
Hawkins, Jeffrey (2021). A Thousand Brains. A New Theory of Intelligence. Oxford University Press.
Hubinger, Evan et al. (2019). “Risks from Learned Optimization in Advanced Machine Learning Systems”. In: pp. 1–39.
Parr, Thomas, Giovanni Pezzulo, and Karl Friston (2022). Active Inference. The Free Energy Principle in Mind, Brain and Behavior. MIT-Press.
Sevilla, Jaime and John Burden (2021). Response to ’Superintelligence cannot be contained: Lessons from Computability Theory’. https://www.cser.ac.uk/ news/response-superintelligence-contained/. Accessed: 2023-01-04.
Soon, Chun Siong et al. (2008). “Unconscious determinants of free decisions in the human brain”. In: Nature Neuroscience 11.5, pp. 543–545.
Though note that programs that really behave as intended under all circumstances are very rare in practice as evidenced by the immense difficulty of computer security.
Think of something like the Libet experiment but perhaps using a brain scanner as in Soon et al. (2008) though at some sci-fi limits of the technology.
The challenge here is similar to the halting challenge mentioned above. Here too the idea is to restrict the usage of arbitrary Turing machines. However, the current problem seems even more difficult as it does not merely require that we figure out how to make an intelligent system non-general but non-general in some very specific sense.
Perhaps even more plausibly, one may speculate that there may be an intrinsic connection between intelligence and a certain form of nihilism in the sense that intelligent systems tend towards wireheading.
Living in a world where AI research is potentially dangerous would be a case in which we have an unknown but considerable probability of living in a vulnerable world due to AI technology, where a vulnerable world means that there is some technology that if deployed almost certainly devastates civilization by default as discussed in (Bostrom 2019).
This does not contradict the self-prediction argument. The self-prediction argument assumes intelligence to be action-oriented. But if the mentioned paradigm were correct, the intellectual capacities of minds may be separated from the action-oriented ones.
Pessimism about AI Safety
TL;DR
The alignment problem is less fundamental for AI safety than the problem of predicting actions of AI systems, especially if they are more intelligent than oneself. We dub this the prediction problem.
The prediction problem may be insoluble.
If the prediction problem is insoluble, predicting the results of the advent of AGI would be impossible. If this is so we have to radically rethink our approach to AI safety and its relation to politics.
Introduction
It is plausible that AI research may result in the creation of systems that are much more intelligent than their human creators. Starting from this assumption it is imperative that we ensure that such a creation would be, on the whole, an event that is beneficial to humanity. Research into how this may be ensured is generally done under the label of AI safety. Our aim in this paper is threefold. First, we argue that the most fundamental problem of AI safety research is that of predicting the actions of AI systems, particularly ones that are more intelligent than us. Secondly, we argue that there are some reasons to suspect that this problem may be insoluble. Finally, we sketch what AI safety research should look like if we were forced to the conclusion that AGI is inherently unsafe.
The most fundamental problem in AI safety lies not in choosing the right kind of goals that align with ours, or in how to control an agent more intelligent than oneself, although these are indeed important problems and formidable challenges. We think that the most fundamental problem in AI safety research is instead that of predicting the behavior of AI systems, which becomes more difficult and pressing the more intelligent these systems become. On the most abstract level, all AI safety research is about trying to make AI systems that produce outcomes that belong to some target set Ω. The abstract nature of what we are trying to show here does not require us to specify Ω. It may be that Ω is the set of outcomes where humanity does not end up extinct, resulting in minimally safe AI. It may be that Ω is the set of outcomes where all humans end up in a state of perpetual utopia, resulting in maximally beneficial AI. Or it may be that Ω is defined more narrowly, as when we require that the AI system in question does not lie to its operators, resulting in truthful AI. Evidently, to ensure that some AI system will achieve or work towards goals in Ω, we first will have to predict whether its actions will achieve states in Ω or at least make such states more likely. We will call any method for predicting whether the outcomes of running some AI system are in Ω a prediction algorithm for that AI system and for that particular target set. We will further call the problem of creating viable prediction algorithms for future AI systems the prediction problem. We will later argue in more detail that the prediction problem is more fundamental than the problem of AI value alignment.
To cut straight to the chase, we fear that there is a paradox implicit in the prediction problem. Here is our first pass at this paradox. We will later show in more detail why the paradox cannot be easily avoided. The paradox, we suppose, arises because there is an inherent connection between intelligence and prediction. Generally, intelligence is closely tied to predictive ability. Imagine two systems S1 and S2. Predicting S1 will get increasingly difficult as the system becomes more intelligent, since with increasing intelligence, S1 makes itself use of more efficient methods for prediction. It seems that intelligence would then be precisely the kind of property that makes a system hard to predict. Or put another way, if the actions of system S1 are easily predictable by a supposedly “dummer” system S2, then system S1 arguably is not so intelligent after all. The idea is thus that there is a trade-off between the intelligence of a system and how predictable it is. Should it turn out that we can make this intuition more precise, and if we are right to claim that the problem of prediction is more fundamental than the problem of value alignment, then most current approaches to AI safety will be in doubt.
We will proceed as follows. Section two will discuss the connection between prediction and AI safety in more detail. Here, we will show that the prediction problem is the most fundamental challenge in AI safety research. Section three will briefly discuss some previous work on the prediction problem. Section four will develop what we call the self-prediction argument against the possibility of viable prediction algorithms and thus the insolubility of the prediction problem. Section five will discuss possible strategies for circumventing the self-prediction argument. Section six will discuss how AI safety research should look like in a world where the prediction problem is indeed insoluble. Section seven ends with some reflections on the light our discussion sheds on the second questions asked by the Open Philanthropy AI Worldviews Contest.
The Centrality of Prediction
The most widely discussed problem of the AI safety discussed in the literature is the alignment problem. Here we try to show that solutions to the alignment problem require that one first solve the prediction problem, the problem of predicting the behavior of AI systems, especially ones that are more intelligent than us. The alignment problem is the problem of building AI systems whose values align with our own (Bostrom 2014). Importantly, it seems that any solution to the alignment problem presupposes a solution to the prediction problem, for knowing whether a system is aligned with our values or not entails that one knows what kinds of states it is going to work towards, what it is going to do.
The centrality of prediction in AI safety research can be obscured by an intriguing but fallacious argument. According to this argument, we do not need to be able to make predictions about the behavior of our AI systems because it is sufficient to know that their goals or, their utility functions, are aligned with our own. For instance, I do not need to be able to predict the exact next move that AlphaGo is going to do. It is sufficient to know that its next move is going to bring it closer to winning the game. Knowledge about goals or utility functions should then be sufficient to know that some AI system is in fact safe. No detailed prediction of the system’s behavior is necessary.
While it is correct that ensuring safety does not require that we predict every output of an AI system, which would defeat its purpose, we do have to predict whether its behavior tends to bring about states in Ω. We are still faced with the prediction problem. Solutions to the prediction problem that make use of utility functions are based on a confusion around the ontology of these functions. There are broadly three views on the ontology of utility functions and the above strategy turns out to be question-begging for each of them. First, one may hold that the utility function of a system is a generalization about the system’s actual behavior. Such a view is taken by popular active inference theories of cognition and behavior. Here, the utility function of a system is ultimately defined by that system’s dynamical attractor. In effect, the utility function of a system is defined by the kinds of states the systems tends to inhabit (Parr, Pezzulo, and Friston 2022): Rabbits desire carrots in the sense that they tend to eat them. Evidently, on this view, presupposing knowledge of the utility function of a system to circumvent the prediction problem would be circular. For on this view, facts about whether some system wants to hurt humans are predicated on whether the system actually does tend to hurt humans.
A second view on the ontology of utility functions is that they are learned features that AI systems acquire in the process of training. The utility function of the AI evolves in its training process where it minimizes its loss function in an iterative process in which its performance is judged and optimized with respect to training data. Suppose we train a robot to find the red square within a labyrinth. Once the robot becomes good at the task we may suppose that it has learned that finding the red square by moving around is good, i.e., is assigned high utility. But how can we know whether this is actually what the system has learned? It may turn out that, once we release the robot into the real world and give it access to red paint, it starts painting every centimeter of the ground red because it hasn’t learned to solve labyrinths but to prefer red as a color of the ground. We have no guarantee that the utility function a system learns during training is in fact the one we intended. And again, figuring this out seems to already require that we are able to predict what the system is going to do. Indeed, the problem of ensuring that the learned utility function of an AI aligns with intended utility function is well-recognized and has come to be known as the inner alignment problem (Hubinger et al. 2019). As there currently seems to be no agreed-upon solution to this problem, one cannot circumvent the prediction problem by appealing to utility functions, conceived of as something obtained during learning.
A third and final view on utility functions conceives them as hard-wired features of AI systems that are engineered into them. Now we grant that if we knew the hard-wired utility function of some AI system, this would solve the prediction problem. However, as far as we can see, hard-wired utility functions of the kind required are a technological fantasy. The systems most promising in the quest for AGI are trained by minimizing some known but uninteresting (as far as prediction is concerned) error function defined over its input and output. They do not possess hard-coded utility functions over and above these error functions. And we see no reason to suppose that progress in AI will result in systems that do any time soon.
We conclude that presupposing significant knowledge of utility functions in solving the prediction problem is either circular or based on entirely stipulative technological innovations. Were we able, be it by learning or hard coding, to reliably specify the utility function of an AI system, this would bring us close to solving the prediction problem. But for the moment, it seems to us that the prediction problem needs to be solved in order to make any headway towards AI safety. In particular, it is more fundamental than the alignment problem because solving the latter presupposes solving the former. In the following section, we briefly discuss some previous work on the prediction problem.
The Prediction Problem and Computability
We are not the first to suspect that the prediction problem may present a deep challenge to the efforts of producing safe AI. Alfonseca et al. (2021) have argued that there cannot be a single algorithm that is capable of deciding, for any specified algorithm and any input, whether it is safe to run for its human creators. Let’s call such a putative algorithm a general (safety-)prediction algorithm. Note that Alfonseca et al.’s focus on safety rather than some other target set Ω is unimportant to their argument. The reason there cannot be a general safety-prediction algorithm is grounded in computability theory, and the halting problem specifically. This is a well-known result of computability theory that says that there cannot be an algorithm that decides, for any given algorithm and input, whether the algorithm will eventually halt. From this, it is easy to deduce the mentioned result. For there may be an AI algorithm that, for every given input, feeds this input into a Turing machine and, if the machine halts, starts hurting humans. A safety algorithm that applies to this AI system would be capable of solving the halting problem which, as mentioned, cannot be done. Thus there cannot be a general safety-prediction algorithm.[1]
The main weakness of the argument provided by Alfonseca et al. lies in its generality of scope. No particular assumptions are made about the algorithm whose safety is to be predicted, except that intelligent systems are able to implement (or “use”) Turing machines. The general safety-prediction algorithm is impossible because it is impossible to have a procedure that decides for every program whether it halts, and as set up in the paper, thereby whether it hurts humans. Less ambitious procedures remain possible. If the provided argument would indeed block research into the predictability of AI systems, then parallel arguments would rule out the predictability of all kinds of software. For instance, there is no general method for deciding whether any given program will behave like a calculator for any given input, because we can construct an algorithm that behaves like a calculator for some input, if a given Turing machine halts, and not if not. This impossibility result hardly infringes on the development of reliable calculators. Or on the development of reliable and predictable software in general for that matter.[2]
So while this impossibility result entails that there can be no general safety algorithm that determines the safety of all AI systems, it does not follow that making AI safe is impossible. For there may still be a particular safety-prediction algorithm that may be applied to many, but not all, algorithms and thereby AIs systems. We only need to require that an AI system’s harmful behavior does not depend on running arbitrary Turing machines on arbitrary input and the result provided by Alfonseca et al. does not apply to this restricted class of AIs (Sevilla and Burden 2021).
Still, the argument of Alfonseca et al. is not without consequences. The burden of proof, in the case of AI safety, is on those who hold that some system is safe. The argument under discussion shows however, that strictly speaking, this can only be done for AI systems that are not able to implement every Turing machine. It is a plausible feature of highly evolved intelligence, as we usually conceive of it, that it entails the capacity to simulate every kind of Turing machine (at least when the environment can be used for data storage). Notably, this is true for LLMs (Schuurmans 2023). It seems that, where complex computational processes are concerned, unpredictability is the default. Thus, it is the burden of AI engineers and AI safety researchers to show that there are reliable methods of building highly intelligent systems that are not computationally universal in the relevant way. This is a requirement for safety and an important challenge, to which we currently lack solutions and promising strategies. This is what makes the case of AI systems different from, say, calculators.
That being said, we believe that, given some natural assumptions about the nature of intelligence, a more general, though less rigorous argument for the insolubility of the prediction problem can be made. We will now try to establish is that there not only cannot be a general prediction algorithm but there also cannot be a specialized prediction algorithm that decides for some particular highly intelligent system whether its behavior will fall within some target set Ω.
The Self-Prediction Argument
It seems to us that an essential feature of intelligence is the ability to engage in what we call deliberation. Deliberation is the process of making up one’s mind on some issue, be it theoretical or practical. Abstractly, this is a way of processing input to arrive at either different behavioral outputs or doxastic states (beliefs). As deliberation alters the probabilities of different behaviors, it also alters the probabilities of behaviors that tend to result in outcomes outside of Ω.
We may think of this type of processing as weighing reasons for or against some action or belief. Any future AI system that truly deserves the label of intelligence will be able to deliberate on a huge number of such theoretical and practical issues. For the moment we will assume that a viable prediction algorithm would have to be a relatively simple procedure that predicts the probability of some system’s behavior bringing about results outside of Ω.
So far these are rather innocent assumptions. The more substantial assumption of our argument is that no intelligent system can predict the results of its own deliberation with certainty. We call this the self-prediction assumption. It is important here that “to predict” the results of a deliberation means to know the results in advance, before engaging in the deliberation itself. The intuition here is that it is necessary to go through the relevant deliberative steps to know the result of the deliberation. The results depend irreducibly on the relevant weighing of reasons. Even when one has some clue about what one will do or think after making up one’s mind about some issue, before one has actually engaged with the topic in detail, with the reasons for and against particular beliefs, it is always possible that, after deliberation, one’s prior assessment turns out to be incomplete or false.
But now there is an evident problem. Since if there exists a safety-prediction algorithm for some AI system, the AI system will itself be able to use that algorithm. Thus the system will be able to predict with certainty the results of its deliberations that alter the probabilities of its behaviors towards outcomes in Ω. However, we just argued that predicting one’s own deliberation is impossible in general. We can thus formulate the following simple argument:
(1) No intelligent system can predict the results of its deliberation with certainty.
(2) If there is a safety-prediction algorithm for a sufficiently intelligent agent, then the agent can use it to predict with certainty the results of its deliberations which alter the probabilities of its behavior towards actions with outcomes outside Ω.
(3) There can be no such safety-prediction algorithm.
The upshot of the argument is that intelligence is unlike other computational activities, like that of a calculator. Deliberation, which we argued we should expect in intelligent agents in general, is essentially something that irreducibly depends on the weighing of reasons. The reasons of a highly intelligent entity about whether or not to engage in behaviors that we see as harmful may be complex. But this conflicts with the existence of a safety algorithm which we argued has to be a relatively simple procedure. If there was such a simple procedure however, it would mean that an AI system could cut its own deliberation short by applying its own safety algorithm. And this seems to conflict with the nature of deliberation. Thus such a prediction algorithm cannot exist. In a nutshell the argument suggests that complex thought cannot be predicted by simple means. If this were correct then AI research likely to result in anything resembling AGI would be intrinsically unpredictable and thus intrinsically unsafe.
Discussion of the Argument
Our discussion of the self-prediction argument will consist in a discussion of its premises. Maybe the most interesting question is whether the first premise, the self-prediction assumption, is correct. We do not have an argument from first principles on offer here. Our defense will thus assume that the principle is indeed intuitively plausible and our discussion will consist in clearing up two possible misconceptions.
Firstly, resistance to the self-prediction assumption may stem from the impression that its puts tight limits on the predictability of deliberation. But, one may hold, as naturalists we should believe that deliberation is no more unpredictable than any other kind of process, given that we can figure out the mechanism behind it. Thus, given knowledge of the relevant mechanisms, deliberation should be rendered predictable, even for the agent that is deliberating.
In fact, the self-prediction assumption does not conflict with the fact that intelligent behavior is predictable. For instance, an external agent may well be able to predict in principle whether I will decide to have pasta or ratatouille for dinner on the basis of knowing my brain states and dynamics in sufficient detail.[3] But this will not help me to engage in self-prediction. For the process of figuring out how I will decide on some issue on the basis of my brain states will be much more computationally demanding than the deliberative activity itself. This harkens back to the view that safety-prediction algorithms need to be simpler than the algorithm they apply to. The implausibility of self-prediction is precisely grounded in the idea that deliberation cannot be cut short, even if is inherently deterministic.
Secondly, one may hold that the self-prediction assumption seems plausible precisely because we possess human minds and not superhuman artificial ones, i.e. it may be anthropocentric. If the impossibility of self-prediction is a feature merely of human intelligence rather than one intrinsic to the structure of intelligence itself, then our argument would indeed be invalidated. This raises the question in how far we can generalize from the human case to the properties of “mind space” in general. While there are some features of human minds that are plausibly general features of minds tout court (capacity of Bayesian inference, capacity to self-model, the possession of goals, etc.) there are also features that plausibly are contingently human (separation into motivational and higher functions, predominance of three-dimensional spatial representations, massive parallel processing etc.).
Evaluating whether the impossibility of self-prediction belongs to the former or the latter camp is hard. However, we think that the plausibility of the self-prediction assumption is likely grounded in the logical structure of deliberation and reasoning rather than the structure of human psychology. In order to arrive at a conclusion about whether or not action A should be carried out, an intelligent agent will weigh reasons for or against A. Whether or not A is done will irreducibly depend on these reasons in the sense that there is no way of circumventing the weighing of pros and cons in order to arrive at a result. Crucially, at no point this reasoning appeals to any particularly human characteristics. Rather, it appeals to the fact that intelligent beings are moved by reasons. It is possible that the whole apparatus of “reasons” and “deliberation” is built on the contingent structure of human psychology. But we are not willing to bet the future of terrestrial life such a radical anti-rationalism. In summary, the self-prediction assumption is neither in conflict with determinism and naturalism nor is it anthropocentric.
Premise two, the assumption that the existence of a prediction algorithm would enable self-prediction, can be defended by the following simple argument.
(2.1) A prediction algorithm for an AI system can be used to predict the results of its deliberative activity that changes the probabilities of its behavior towards actions with outcomes outside Ω.
(2.2) A sufficiently intelligent agent can make use of any computational mechanism sufficiently simpler than itself.
(2.3) A safety-prediction algorithm has to be simpler than the intelligent system to which it applies.
(2) If there is a safety-prediction algorithm for a sufficiently intelligent agent, then the agent can use it to predict with certainty the results of its deliberations which alter the probabilities of its behavior towards actions with outcomes outside Ω.
Our discussion of premise two will consist of a discussion of the three sub-premises. The first sub-premise may be attacked using a probabilistic strategy. We defined a safety algorithm as whatever kind of prediction mechanism is powerful enough to predict that an AI system is safe. It may be argued that such safety does not require there to be logical certainty about the future behavior of the relevant system. On this view, it would be sufficient to know that an AI system with high probability will not e.g. hurt humans.
Accepting this contention results in a weakened form of the second premise according to which a sufficiently intelligent system could predict its own decision regarding whether to hurt humans with high probability. In our view, this version of the second premise is too weak to sustain the self-prediction argument. For to reach the original conclusion one would have to replace premise one with a suitably weakened thesis. Such a weakened premise would hold that no intelligent system can predict the outcomes of its own deliberation with high probability. But this weakened self-prediction assumption strikes us as implausible, for humans regularly engage in such tasks of probabilistic self-prediction. I can predict with relatively high confidence, for instance, that I will not have ice cream for dinner even without engaging in detailed reflection on this issue. Therefore, the upshot of the self-prediction argument is that provably safe, provably beneficial, provably truthful AI, and so on seems impossible, while it remains silent on predictions of safety or beneficence algorithms that give probabilistic outputs.
Still, not all is well. First and foremost, as the stakes are high, an appeal to probabilistic arguments in the context of AI safety is inherently problematic. For instance, it would be insufficient to show that it is merely plausible that some general AI is safe to justify its implementation. We suggest that any suitable probabilistic prediction method has to result in predictions that provably fall within some predefined error margin. As far as we can see, there are no reliable methods for producing predictions of this kind for the kinds of systems that are likely to exhibit general intelligence.
Secondly, there are independent reasons to suspect that there are fundamental obstacles to probabilistic prediction of systems more intelligent than oneself. Any system trying to control an AI system without relying on some provably sound prediction algorithm will fall prey to the good regulator theorem. This theorem says that every system that tries to control some other system that is subject to random perturbations will have to be structurally isomorphic to that system (Contant and Ashby 1970). This is sometimes expressed as the proposition that every good regulator of some system must possess a model of that system. In the context of AI safety, such a controlling system would defeat its purpose for two reasons. First, it would need to have similarly complex or more complex than the AI which is to be investigated. This likely is somewhere between immensely costly and infeasible in this context. But even more importantly, since the controlling system is structurally isomorphic to the AI which we want to investigate, it is potentially just as in need of prediction as the original AI system we are trying to make safe. We therefore could not use such a safety test without recreating similar risks which we try to avoid in the first place. While further discussion is necessary here, we do think that the good regulator theorem makes a good initial case against the existence of viable probabilistic prediction algorithms. In general, we think that the good regulator theorem deserves the close attention of AI safety researchers. In summary, the probabilistic strategy would be viable only if there were reliable methods for estimating error margins and if there were a convincing argument why the strategy does not fall prey to the good regulator theorem.
A prima facie reasonable strategy of circumventing the oddities of self-prediction and still retain the possibility of prediction or control is to insist that AI systems could be designed in such a way as to be constitutively incapable of applying their own safety algorithm. This would be a way to invalidate the second sub-premise. Such a design task may be achieved by either prohibiting the AI system to run the safety algorithm itself or by prohibiting the AI system to know its own source code which the safety algorithm takes as input. Thus there is some subset of the set of conceivable AIs to which the self-prediction argument cannot be applied. Unfortunately, this strategy seems like an ad hoc answer rather than a solution since we have no clue how any of these solutions could be achieved practically. We have no idea how one may even start to build an agent capable of flexible learning as required for general intelligence that is constitutively incapable of learning and doing some relatively simple things. While this may be a mere limitation of current AI it is also possible that this may also turn out to be a general limitation.[4] At any rate, this restrictionist strategy does not seem promising.
Finally, one may challenge the third sub-premise by holding that there may be viable safety algorithms that are more computationally complex than the intelligent system to which they apply. While the implementation of such an algorithm may then seriously increase the computational demands of running an AI system, the idea is certainly not out of the question. The obvious problem for this approach is that one needs to ensure that the relevant prediction algorithm is not itself a candidate for general intelligence. Otherwise, we would face a vicious regress of prediction systems. As the existence of such a “stupid but complex” prediction system is purely stipulative we think that the third sub-premise is plausible.
We conclude that the self-prediction argument is rather solid. If one thinks that the self-prediction assumption is intuitively plausible then one has three options left for solving the prediction problem. The first of these is the probabilistic strategy, which tries to make probabilistic predictions about the AI system’s behavior. Absent any provably sound algorithmic approaches this strategy will have to find some way of making predictions with precise error margins and maybe also a way of circumventing the good regulator theorem, the discussion of which goes beyond the scope of our article. The second possible strategy for circumventing the self-prediction argument is the restrictionist strategy of building AIs that are incapable of self-prediction. The problem here is that it is far from clear whether this can be done. Finally, there may be a prediction algorithm that is more complex than the AI without being similarly intelligent itself, thereby avoiding a regress of the prediction problem. None of these are obviously promising.
In the introduction, we mentioned that we suspect that the self-prediction argument is merely an instance of the more fundamental fact that there is a trade-off between predictability and intelligence. If we are right, this would make it unlikely that any simple workaround is available here. In particular, it would make it unlikely that any of the three aforementioned strategies will bear fruit. Until a more rigorous argument or proof is on the table, we can however not be certain of this point.
Importantly, the burden of proof lies on the side of those who want to employ AI systems likely to result in anything resembling AGI. If it turns out that our argument remains plausible under sufficient scrutiny, then it offers a decisive reason not to employ any such AI system, even if the argument cannot be strengthened into a formal proof.
Pessimism about AI Safety
If the issues we raised cannot suitably be addressed, then research on advanced AI will be inherently unsafe. Even if it turns out after deployment that advanced AI systems do not pose any threat to humanity, maybe because it is discovered that there is some yet unrecognized connection between intelligence and moral virtue, if this wasn’t known before the deployment, the relevant system wasn’t safe in the sense required.[5] In the following, we will thus refer to AI research which we do not know to be safe as potentially dangerous, i.e. the relevant sense of “potentially” is epistemic. This entails a crucial question: What if there will always remain strong reasons to doubt the safety of deploying candidates for AGI and thus it will always be a potentially dangerous technology? What if the deployment of such systems will always involve a gamble? We call this view pessimism about safe AI.[6] Our treatment here will be superficial as this topic is well worth a book-length treatment—or rather many book-length treatments.
First, one may ask whether one should restrict AI research at all—even if one did not know it to be safe. There are a number of possible motivations here. First, one may think that the evolution of technology is essentially a natural process that cannot be effectively controlled by policymakers. But this viewpoint is myopic. Nuclear weapons are an interesting case in point here. The development of nuclear weapons is pretty much impossible for private actors, and politically costly for most political ones due to the perceived dangers of nuclear proliferation. While we do not live in a world free of nuclear weapons, it is safe to say that nuclear proliferation has been greatly halted by sheer political will. There is no reason to think that something similar could not be done in the case of potentially dangerous AI technology. Certainly, such policies would not work perfectly, but it is equally certain that they could greatly decrease the risk of the development of dangerous AI.
A second reason for not infringing on dangerous AI research may be the view that progress always entails a certain risk. However, as the stakes are existential, we do not consider this position worthy of discussion. If AI technology bears the risk of ending human life, then its development motivated by some vague intuition of progress is plainly mad.
On the pessimist scenario, it seems that the goals of AI safety research are bound to change substantially. Rather than investigating how AI may be made safe, AI safety research should then focus on minimizing the risk that true AGI is ever achieved. This involves two challenges. First, the demarcation challenge is the challenge of differentiating potentially dangerous (i.e. potentially resulting in AGI) and benign AI research. Second, the proliferation challenge is the challenge of suppressing AI technology that is deemed potentially dangerous.
Different schools of AI will result in different views on the demarcation problem. Hawkins (2021) has suggested that the problem-solving aspect of intelligence can effectively be disentangled from its motivational aspects.[7] Here, virtually all AI research can be done in such a way that the chance of the emergence of a truly autonomous intelligent agent is minimal. On the other extreme, more cybernetically inclined approaches like active inference would suggest that any system for controlling “physiological” parameters may result in super-intelligent agency if the control process becomes sophisticated enough.[8]
In our view, it is unlikely that the demarcation challenge will be solved in any straightforward way. Rather, we suspect that there will be some degree of plausible reasoning involved in deciding whether some approach to AI is benign or not. Still, the development of general guidelines and principles is a central goal of AI safety research on the pessimist’s agenda.
We will only have some short comments on the proliferation problem as it is primarily a political issue. Here too, there will be two broadly different strategies, a liberal path, incentivizing only against those AI technologies deemed most dangerous, and a path of strong regulation, banning all kinds of modestly dangerous research.
While one may think that strong regulation will always offer maximum security, this is not obvious. For the stronger the political measures, the greater the incentive to work against or circumvent them. On the other hand, even liberal path, if it is to be effective at all, would have to involve massive regulation, presumably by state actors. For instance, considering the progress in computer technology, the personal computers of 2035 may have the calculation power only supercomputers achieve today. It may thus be feasible to implement a superhuman AI on such a machine. The logical consequence is that either one would have to put legal limits on the computational power of personal computers, implement some kind of surveillance state, or use some hardware-based enforcement that prevents the implementation of potentially dangerous AI software. The latter strategy would not be without precedent, even though on a different scale. Enforcing limits on the usage of dangerous technology is well-known in the world of bio-tech. For instance, there is the secure DNA project, which aims to screen orders for DNA synthesis which are too dangerous to be publicly available. As bio-tech labs screen for requests that are too dangerous today, we can imagine a future where providers of processing power screen for potentially dangerous AI applications.
We admit that none of the available political options sound particularly appealing. This is why, in our view, given the progress in AI capabilities and current AI safety techniques, the question of how AI technology may be controlled effectively with minimal cost of liberty should be an active area of research.
All these strategies involve massive political challenges. Any ban on AI research would be useless unless agreed upon by all the powerful political actors and even hardware approaches would require a substantial amount of control, coordination, and enforcement. In a pessimistic scenario, the worst thing that may happen is rapid progress of AI technology in the context of some kind of arms race, be the actors companies striving for market dominance, state actors in some kind of (hot or cold) war, or both. Collaboration between all the relevant actors on the issue on national and international levels would be imperative. On the bright side, the threat of malicious or misaligned AGI may be a sufficiently motivating force to entail increased cooperation if it is sufficiently tangible. Unlike the rather abstract danger of global warming, the threat of a rouge AI can be intuitively grasped and already has a predominant foothold in the public imagination. This is clearly shown by a recent poll where 55% of the general public in the US was somewhat or very worried about existential risk from AI. This is one reason we see for expecting a political reaction.
A second reason for a constrained form of optimism is the following. A strong societal and political response against dangerous AI research after a moment of realization on the issue is likely, unless the further deployment and takeoff of AGI happens in a very specific way. In our view, one should expect at least a wake-up moment akin to the one we had with Covid, unless one believes in extremely fast takeoff speeds and thus does not expect many more misaligned AI models of increasing capacities to be released before this takeoff. If either of these does not hold, non-existential demonstrations of the dangers of unconstrained AI research are likely to occur, thereby making voters and policymakers aware of the issue.
We are uncertain about how society will respond to such an event but it seems likely that that what is politically and socially possible would change as a result. Finally, we want to emphasize that, as soon as we leave behind the standard AI safety paradigm and take the pessimist scenario seriously, it quickly becomes evident that all solutions to the problems of AI will have to involve political action.
Some Conclusions
One of the tasks of the Open Philanthropy AI Worldviews Contest is to estimate the probability of doom scenarios due to loss of control over an AGI system, given the development of AGI by 2070. The strategy of our paper was to go meta on this question. Our reasoning suggests that any such estimate would be dubious since the behavior of systems that are more intelligent than us lies behind an epistemic event horizon. More abstractly, we want to suggest that the lead question is predicated on a paradigm of AI safety that constrains the field. According to this paradigm, the task of AI safety research consists primarily in finding technical solutions for aligning AI systems. As we argued, such an approach presupposes that the alignment problem is indeed solvable, which in turn presupposes that the prediction problem is solvable. But as we have shown, there are good reasons to be skeptical of this assumption.
Leaving this paradigm behind, AI safety researchers should start to seriously ask the question: What if AI is inherently unsafe? Once this question is on the table, it quickly becomes evident that appeals to technological optimism or the inevitability of technological progress simply will not do. So far the dominant paradigm has been that AGI is essentially unavoidable and only has to be built in the right way or controlled well enough. We want to consider a different paradigm. In this paradigm, intelligence may be a little bit like fire. Just because you have learned to make it, does not mean it is wise to make as much of it as possible. And its responsible use does not consist in figuring out how to make it inherently safe since that is impossible, but in figuring out how to employ it to achieve specific tasks without burning the house down.
We now want to be a little more specific on the implications of our considerations for the question of the probability of an AI doomsday scenario, given we reach AGI. In light of what we have said, we have to abstain from assigning probabilities to the probability of an existential catastrophe by AGI.
We do however see our argument making a case that existential catastrophe is more likely. Since we see the prediction problem as underlying the alignment problem and the prediction problem as potentially insoluble, the argument entails that current efforts to achieve AI alignment have relatively little impact on the probability of AI doomsday scenarios. Additionally, we think that the currently on vogue approach of reinforcement learning with human feedback (RLHF and variations thereof) should not update one’s probability of an existential catastrophe by AGI. This is because RLHF and its variations precisely do not render the systems in question any more predictable. There is no reason to believe that the human feedback provided in the training process will generalize in the right manner to the point of deployment, making the approach essentially worthless as a response to the prediction problem and thereby the alignment problem.
Before concluding, we want to be absolutely clear about what our argument is not. It is not that since AI alignment is clearly doomed, there is no point in working on it. The argument should not at this point, and not in its current form, discourage any AI safety research. Neither the argument from computability, nor the self-prediction argument would justify this conclusion. The argument from computability simply does not entail that AI alignment research is doomed to failure, but merely that there are no solutions that apply to AI systems generally. The self-prediction argument relies on intuitive assumptions and might as such be wrong or misguided. But since these intuitions seem rather stable to us, we think that we should at the very least take their implications seriously. Our argument is also not that AI is inherently unsafe and we should thus implement an anti-technological global surveillance state that suppresses the progress of computer technology. Rather, our point is that while solving the prediction problem is necessary for solving the alignment problem, it has received little attention in AI safety work, and despite several recent expressions of pessimism on AI safety, the possibility that there may be no way to make AGI safe has rarely been discussed seriously. The question of how to reasonably address this possibility has at the same time been discussed even less. Sufficient time, money, and energy should be allocated to this task.
When we started writing this essay, we thought that our opinion, namely supporting strong political measures against the further increase of AI capabilities, was likely to be a fringe position. We see it as a positive sign positive that this has become more widespread.
References
Alfonseca, Manuel et al. (2021). “Superintelligence Cannot Be Contained: Lessons
from Computability Theory”. In: J. Artif. Int. Res. 70, pp. 65–76.
Ashby, W. Ross (1947). “Principles of the Self-Organizing Dynamic System”. In:
The Journal of General Psychology 37.2, pp. 125–128.
Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University
Press.
– (2019). “The Vulnerable World Hypothesis”. In: Global Policy 10.4, pp. 455–476.
Contant, Roger C. and W. Ross Ashby (1970). “Every good regulator of a system
must be a model of that system”. In: International Journal of Systems Science
1.2, pp. 89–97.
Hawkins, Jeffrey (2021). A Thousand Brains. A New Theory of Intelligence. Oxford
University Press.
Hubinger, Evan et al. (2019). “Risks from Learned Optimization in Advanced Machine
Learning Systems”. In: pp. 1–39.
Parr, Thomas, Giovanni Pezzulo, and Karl Friston (2022). Active Inference. The
Free Energy Principle in Mind, Brain and Behavior. MIT-Press.
Schuurmans, Dale (2023). Memory Augmented Large Language Models are Computationally Universal. https://arxiv.org/pdf/2301.04589.
Sevilla, Jaime and John Burden (2021). Response to ’Superintelligence cannot be
contained: Lessons from Computability Theory’. https://www.cser.ac.uk/
news/response-superintelligence-contained/. Accessed: 2023-01-04.
Soon, Chun Siong et al. (2008). “Unconscious determinants of free decisions in the
human brain”. In: Nature Neuroscience 11.5, pp. 543–545.
As mentioned by Alfonseca et al. (2021), this result also follows directly from Rice’s theorem.
Though note that programs that really behave as intended under all circumstances are very rare in practice as evidenced by the immense difficulty of computer security.
Think of something like the Libet experiment but perhaps using a brain scanner as in Soon et al. (2008) though at some sci-fi limits of the technology.
The challenge here is similar to the halting challenge mentioned above. Here too the idea is to restrict the usage of arbitrary Turing machines. However, the current problem seems even more difficult as it does not merely require that we figure out how to make an intelligent system non-general but non-general in some very specific sense.
Perhaps even more plausibly, one may speculate that there may be an intrinsic connection between intelligence and a certain form of nihilism in the sense that intelligent systems tend towards wireheading.
Living in a world where AI research is potentially dangerous would be a case in which we have an unknown but considerable probability of living in a vulnerable world due to AI technology, where a vulnerable world means that there is some technology that if deployed almost certainly devastates civilization by default as discussed in (Bostrom 2019).
This does not contradict the self-prediction argument. The self-prediction argument assumes intelligence to be action-oriented. But if the mentioned paradigm were correct, the intellectual capacities of minds may be separated from the action-oriented ones.
Some interesting pessimistic comments on AI safety from a cybernetic viewpoint may be found in as early a source as Ashby (1947).