tl;dr: “lack of rigorous arguments for P is evidence against P” is typically valid, but not in case of P = AI X-risk.
A high-level reaction to your point about unfalsifiability: There seems to be a general sentiment that “AI X-risk arguments are unfalsifiable ==> the arguments are incorrect” and “AI X-risk arguments are unfalsifiable ==> AI X-risk is low”.[1] I am very sympathetic to this sentiment—but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.
Why I believe this? Take this simplified argument for AI X-risk:
Some important future AIs will be goal-oriented, or will behave in a goal-oriented way in sometimes[3]. (Read: If you think of them as trying to maximise some goal, you will make pretty good predictions.[4])
The “AI-progress tech-tree” is such that discontinous jumps in impact are possible. In particular, we will one day go from “an AI that is trying to maximise some goal, but not doing a very good job of it” to “an AI that is able to treat humans and other existing AIs as ‘environment’, and is going to do a very good job at maximising some goal”.
For virtually any[5] goal specification, doing a sufficiently[6] good job at maximising that goal specification leads to an outcome where every human is dead.
FWIW, I think that having a strong opinion on (1) and (2), in either direction, is not justified.[7] But in this comment, I only want to focus on (3) --- so let’s please pretend, for the sake of this discussion, that we find (1) and (2) at least plausible. What I claim is that even if we lived in a universe where (3) is true, we should still expect even the best arguments for (3) (that we might realistically identify) to be unfalsifiable—at least given realistic constraints on falsification effort and assumming that we use rigorous standards for what counts as a solid evidence, like people do in mathematics, physics, or CS.
What is my argument for “even best arguments for (3) will be unfalsifiable”? Suppose you have an environment E that contains a Cartesian agent (a thing that takes actions in the environment and—let’s assume for simplicity—has perfect information about the environment, but whose decison-making computation happens outside of the environment). And suppose that this agent acts in a way that maximises[8] some goal specification[9] over E. Now, E might or might not contain humans, or representations of humans. We can now ask the following question: Is it true that, unless we spend an extremely high amont of effort (eg, >5 civilisation-years), any (non-degenerate[10]) goal-specification we come up with will result in human extinction[11] in E when maximised by the agent. I refer to this as “Extinction-level Goodhart’s Law”.
I claim that: (A) Extinction-level Goodhart’s Law plausibly holds in the real world. (At least the thought expertiments I know, eg here or here, of suggest it does.) (B) Even if Extinction-level Goodhart’s Law was true in the real world, it would still be false in environments where we could verify it experimentally (today, or soon) or mathematically (by proofs, given realistic amounts of effort). ==> And (B) implies that if we want “solid arguments”, rather than just thought expertiments, we might be kinda screwed when it comes to Extinction-level Goodhart’s Law.
And why do I believe (B)? The long story is that I try to gesture at this in my sequence on “Formalising Catastrophic Goodhart”. The short story is that there are many strategies for finding “safe to optimise” goal specifications that work in simpler environments, but not in the real-world (examples below). So to even start gaining evidence on whether the law holds in our world, we need to investigate envrionments where those simpler strategies don’t work—and it seems to me that those are always too complex for us to analyse mathematically or run an AI there which could “do a sufficiently good job a trying to maximise the goal specification”. Some examples of the above-mentioned strategies for finding safe-to-optimise goal specifications: (i) The environment contains no (representations of) humans, or those “humans” can’t “die”, so it doesn’t matter. EG, most gridworlds. (ii) The environment doesn’t have any resources or similar things that would give rise to convergent instrumental goals, so it doesn’t matter. EG, most gridworlds. (iii) The environment allows for a simple formula that checks whether “humans” are “extinct”, so just add a huge penalty if that formula holds. (EG, most gridworlds where you added “humans”.) (iv) There is a limited set of actions that result in “killing” “humans”, so just add a huge penalty to those. (v) There is a simple formula for expressing a criterion that limits the agent’s impact. (EG, “don’t go past these coordinates” in a gridworld.)
All together, this should explain why the “unfalsifiability” counter-argument does not hold as much weight, in the case of AI X-risk, as one might intuitively expect.
If I understand you correctly, you would endorse something like this? Quite possibly with some disclaimers, ofc. (Certainly I feel that many other people endorse something like this.)
I acknowledge that the general heuristic “argument for X is unfalsifiable ==> the argument is wrong” holds in most cases. And I am aware we should be sceptical whenever somebody goes “but my case is an exception!”. Despite this, I still believe that AI X-risk genuinely is different from invisible dragons in your garage and conspiracy theories.
That said, I feel there should be a bunch of other examples where the heuristic doesn’t apply. If you have some that are good, please share!
An example of this would be if GPT-4 acted like a chatbot most of the time, but tried to take over the world if you prompt it with “act as a paperclipper”.
By “virtual any” goal specification (leading to extinction when maximised), I mean that finding a goal specification for which extinction does not happen (when maximised) is extremely difficult. One example of operationalising “extremely difficult” would be “if our civilisation spent all its efforts on trying to find some goal specification, for 5 years from today, we would still fail”. In particular, the claim (3) is meant to imply that if you do anything like “do RLHF for a year, then optimise the result extremely hard”, then everybody dies.
For the purposes of this simplified AI X-risk argument, the AIs from (2), which are “very good at maximising a goal”, are meant to qualify for the “sufficiently good job at maximising a goal” from (3). In practice, this is of course more complicated—see e.g. my post on Weak vs Quantitative Extinction-level Goodhart’s Law.
Or at least there are no publicly available writings, known to me, which could justifiy claims like “It’s >=80% likely that (1) (or 2) holds (or doesn’t hold)”. Of course, (1) and (2) are too vague for this to even make sense, but imagine replacing (1) and (2) by more serious attempts at operationalising the ideas that they gesture at.
Most reasonable ways of defining what “goal specification” means should work for the argument. As a simple example, we can think of having a reward function R : states --> R and maximising the sum of R(s) over any long time horizon.
To be clear, there are some trivial ways of avoiding Extinction-level Goodhart’s Law. One is to consider a constant utility function, which means that the agent might as well take random actions. Another would be to use reward functions in the spirit of “shut down now, or get a huge penalty”. And there might be other weird edge cases. I acknowledge that this part should be better developed. But in the meantime, hopefully it is clear—at least somewhat—what I am trying to gesture at.
Most environments won’t contain actual humans. So by “human extinction”, I mean the “metaphorical humans being metaphorically dead”. EG, if your environment was pacman, then the natural thing would be to view the pacman as representing a “human”, and being eaten by the ghosts as representing “extinction”. (Not that this would be a good model for studying X-risk.)
An illustrative example, describing a scenario that is similar to our world, but where “Extinction-level Goodhart’s law” would be false & falsifiable (hat tip Vincent Conitzer):
Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand “humanity” as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could “prove” this using a simple, yet quite rigorous, physics argument.[1]
(To be clear, I am not saying that “AI X-risk’s unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors”. I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk… )
And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.
FWIW, I acknowledge that my presentation of the argument isn’t ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.
tl;dr: “lack of rigorous arguments for P is evidence against P” is typically valid, but not in case of P = AI X-risk.
A high-level reaction to your point about unfalsifiability:
There seems to be a general sentiment that “AI X-risk arguments are unfalsifiable ==> the arguments are incorrect” and “AI X-risk arguments are unfalsifiable ==> AI X-risk is low”.[1] I am very sympathetic to this sentiment—but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.
Why I believe this?
Take this simplified argument for AI X-risk:
Some important future AIs will be goal-oriented, or will behave in a goal-oriented way in sometimes[3]. (Read: If you think of them as trying to maximise some goal, you will make pretty good predictions.[4])
The “AI-progress tech-tree” is such that discontinous jumps in impact are possible. In particular, we will one day go from “an AI that is trying to maximise some goal, but not doing a very good job of it” to “an AI that is able to treat humans and other existing AIs as ‘environment’, and is going to do a very good job at maximising some goal”.
For virtually any[5] goal specification, doing a sufficiently[6] good job at maximising that goal specification leads to an outcome where every human is dead.
FWIW, I think that having a strong opinion on (1) and (2), in either direction, is not justified.[7] But in this comment, I only want to focus on (3) --- so let’s please pretend, for the sake of this discussion, that we find (1) and (2) at least plausible. What I claim is that even if we lived in a universe where (3) is true, we should still expect even the best arguments for (3) (that we might realistically identify) to be unfalsifiable—at least given realistic constraints on falsification effort and assumming that we use rigorous standards for what counts as a solid evidence, like people do in mathematics, physics, or CS.
What is my argument for “even best arguments for (3) will be unfalsifiable”?
Suppose you have an environment E that contains a Cartesian agent (a thing that takes actions in the environment and—let’s assume for simplicity—has perfect information about the environment, but whose decison-making computation happens outside of the environment). And suppose that this agent acts in a way that maximises[8] some goal specification[9] over E. Now, E might or might not contain humans, or representations of humans. We can now ask the following question: Is it true that, unless we spend an extremely high amont of effort (eg, >5 civilisation-years), any (non-degenerate[10]) goal-specification we come up with will result in human extinction[11] in E when maximised by the agent. I refer to this as “Extinction-level Goodhart’s Law”.
I claim that:
(A) Extinction-level Goodhart’s Law plausibly holds in the real world. (At least the thought expertiments I know, eg here or here, of suggest it does.)
(B) Even if Extinction-level Goodhart’s Law was true in the real world, it would still be false in environments where we could verify it experimentally (today, or soon) or mathematically (by proofs, given realistic amounts of effort).
==> And (B) implies that if we want “solid arguments”, rather than just thought expertiments, we might be kinda screwed when it comes to Extinction-level Goodhart’s Law.
And why do I believe (B)? The long story is that I try to gesture at this in my sequence on “Formalising Catastrophic Goodhart”. The short story is that there are many strategies for finding “safe to optimise” goal specifications that work in simpler environments, but not in the real-world (examples below). So to even start gaining evidence on whether the law holds in our world, we need to investigate envrionments where those simpler strategies don’t work—and it seems to me that those are always too complex for us to analyse mathematically or run an AI there which could “do a sufficiently good job a trying to maximise the goal specification”.
Some examples of the above-mentioned strategies for finding safe-to-optimise goal specifications: (i) The environment contains no (representations of) humans, or those “humans” can’t “die”, so it doesn’t matter. EG, most gridworlds. (ii) The environment doesn’t have any resources or similar things that would give rise to convergent instrumental goals, so it doesn’t matter. EG, most gridworlds. (iii) The environment allows for a simple formula that checks whether “humans” are “extinct”, so just add a huge penalty if that formula holds. (EG, most gridworlds where you added “humans”.) (iv) There is a limited set of actions that result in “killing” “humans”, so just add a huge penalty to those. (v) There is a simple formula for expressing a criterion that limits the agent’s impact. (EG, “don’t go past these coordinates” in a gridworld.)
All together, this should explain why the “unfalsifiability” counter-argument does not hold as much weight, in the case of AI X-risk, as one might intuitively expect.
If I understand you correctly, you would endorse something like this? Quite possibly with some disclaimers, ofc. (Certainly I feel that many other people endorse something like this.)
I acknowledge that the general heuristic “argument for X is unfalsifiable ==> the argument is wrong” holds in most cases. And I am aware we should be sceptical whenever somebody goes “but my case is an exception!”. Despite this, I still believe that AI X-risk genuinely is different from invisible dragons in your garage and conspiracy theories.
That said, I feel there should be a bunch of other examples where the heuristic doesn’t apply. If you have some that are good, please share!
An example of this would be if GPT-4 acted like a chatbot most of the time, but tried to take over the world if you prompt it with “act as a paperclipper”.
And this way of thinking about them is easier—description length, etc—than other options. EG, no “water bottles maximising being a water battle”.
By “virtual any” goal specification (leading to extinction when maximised), I mean that finding a goal specification for which extinction does not happen (when maximised) is extremely difficult. One example of operationalising “extremely difficult” would be “if our civilisation spent all its efforts on trying to find some goal specification, for 5 years from today, we would still fail”. In particular, the claim (3) is meant to imply that if you do anything like “do RLHF for a year, then optimise the result extremely hard”, then everybody dies.
For the purposes of this simplified AI X-risk argument, the AIs from (2), which are “very good at maximising a goal”, are meant to qualify for the “sufficiently good job at maximising a goal” from (3). In practice, this is of course more complicated—see e.g. my post on Weak vs Quantitative Extinction-level Goodhart’s Law.
Or at least there are no publicly available writings, known to me, which could justifiy claims like “It’s >=80% likely that (1) (or 2) holds (or doesn’t hold)”. Of course, (1) and (2) are too vague for this to even make sense, but imagine replacing (1) and (2) by more serious attempts at operationalising the ideas that they gesture at.
(or does a sufficiently good job of maximising)
Most reasonable ways of defining what “goal specification” means should work for the argument. As a simple example, we can think of having a reward function R : states --> R and maximising the sum of R(s) over any long time horizon.
To be clear, there are some trivial ways of avoiding Extinction-level Goodhart’s Law. One is to consider a constant utility function, which means that the agent might as well take random actions. Another would be to use reward functions in the spirit of “shut down now, or get a huge penalty”. And there might be other weird edge cases.
I acknowledge that this part should be better developed. But in the meantime, hopefully it is clear—at least somewhat—what I am trying to gesture at.
Most environments won’t contain actual humans. So by “human extinction”, I mean the “metaphorical humans being metaphorically dead”. EG, if your environment was pacman, then the natural thing would be to view the pacman as representing a “human”, and being eaten by the ghosts as representing “extinction”. (Not that this would be a good model for studying X-risk.)
An illustrative example, describing a scenario that is similar to our world, but where “Extinction-level Goodhart’s law” would be false & falsifiable (hat tip Vincent Conitzer):
Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand “humanity” as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could “prove” this using a simple, yet quite rigorous, physics argument.[1]
(To be clear, I am not saying that “AI X-risk’s unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors”. I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk… )
And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.
FWIW, I acknowledge that my presentation of the argument isn’t ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.