One argument for the “playbook” rather than the “plan” view is that there is a big difference between DISASTER (something very bad happening) and DOOM (irrecoverable extinction-level catastrophe). Consider the case of nuclear weapons. Arguably the disaster of Hiroshima and Nagasaki bombs led us to better arms control which helped so far prevent the catastrophe (even if not quite existential one) of an all-out nuclear war. In all but extremely fast take-off scenarios, we should see disasters as warning signs before doom.
The good thing is that avoiding disasters makes good business. In fact, I don’t expect AI labs to require any “altrusim” to focus their attention on alignment and safety. This survey by Timothy Lee on self-driving cars notes that after a single tragic incident in which an Uber self-driving car killed a pedestrian, “Uber’s self-driving division never really recovered from the crash, and Uber sold it off in 2020. The rest of the industry vowed not to repeat Uber’s mistake.” Given that a single disaster can be extremely hard to recover from, smart leaders of AI labs should focus on safety, even if it means being a little slower to the market.
While the initial push is to get AI to match human capabilities, as these tools become more than impressive demos and need to be deployed in the field, the customers will care much more about reliability and safety than they do about capabilities. If I am a software company using an AI system as a programmer, it’s more useful to me if it can reliably deliver bug-free 100-line subroutines than if it writes 10K sized programs that might contain subtle bugs. There is a reason why much of the programming infrastructure for real-world projects, including pull requests, code reviews, unit tests, is not aimed at getting something that kind of works out as quickly as possible, but rather make sure that the codebase grows in a reliable and maintainable fashion.
This doesn’t mean that the free market can take care of everything and that regulations are not needed to ensure that some companies don’t make a quick profit by deploying unsafe products and pushing off externalities to their users and the broader environment. (Indeed, some would say that this was done in the self-driving domain...) But I do think there is a big commercial incentive for AI labs to invest in research on how to ensure that systems pushed out behave in a predictable manner, and don’t start maximizing paperclips.
p.s. The nuclear setting also gives another lesson (TW: grim calculations follow). It is much more than a factor of two harder to extinguish 100% of the population than to kill the ~50% or so that live in large metropolitan areas. Generally, the difference between the effort needed to kill 50% of the population and the effort to kill a 1-p fraction should scale at least as 1/p.
Thanks for the thoughts! I agree that there will likely be commercial incentives for some amount of risk reduction, though I worry that the incentives will trail off before the needs trail off—more on that here and here.
These are interesting! And deserve more discussion than just a comment.
But one high level point regarding “deception” is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this—second pair I tried). While in other cases in computing technology we talk about “five nine’s reliability”, in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which is part of why self-driving cars are not deployed yet.
If we cannot even make AIs be perfect at the task that they were explicitly made to perform, there is no reason to imagine they would be even close to perfect at deception either.
I agree that today’s AI systems aren’t highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that’s a double-edged sword.
Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?
At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm A that for every input x computes some function f with probability at least 2⁄3 (i.e. Pr[A(x)=f(x)]≥2/3) then if we spend k times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with k (i.e.Pr[Ak(x)=f(x)]≥1−exp(−k/10) or something like that where Ak is the algorithm obtained by scaling up A to compute it k times and output the plurality value).
This is not something special to randomized algorithms. This also holds in the context of noisy communication and error correcting codes, and many other settings. Often we can get to 1−δ success at a price of O(log(1/δ)) , which is why we can get things like “five nines reliability” in several engineering fields.
In contrast, so far all our scaling laws show that when we scale our neural networks by spending a factor of k more computation, we only get a reduction in the error that looks like k−α so it’s polynomial rather than exponential, and even the exponent of the polynomial is not that great (and in particular smaller than one).
So while I agree that scaling up will yield progress on reliability as well, at least with our current methods, it seems that we would do things that are 10 or 100 times more impressive than what we do now, before we get to the type of 99.9% and better reliability on the things that we currently do. Getting to do something that is both super-human in capability as well as has such a tiny probability of failure that it would not be detected seems much further off.
In addition to some generalized concern about “unknown unknowns” leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:
Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
Being able to use that effort productively, via things like “trying multiple angles on a question” and “setting up systems for error checking.”
I think that in some sense humans are quite unreliable, and use a lot of scaffolding—variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it’s really important to deceive someone, they’re going to make a lot of use of things like this).
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like “chain of thought” already spend more time on inference compute to get better performance, but they don’t really have a “knob” we can turn so we can control the computation/reliability tradeoff.)
Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)
In particular, even if we apply inference time compute, unfortunately I don’t think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.
Thank you for writing this!
One argument for the “playbook” rather than the “plan” view is that there is a big difference between DISASTER (something very bad happening) and DOOM (irrecoverable extinction-level catastrophe). Consider the case of nuclear weapons. Arguably the disaster of Hiroshima and Nagasaki bombs led us to better arms control which helped so far prevent the catastrophe (even if not quite existential one) of an all-out nuclear war. In all but extremely fast take-off scenarios, we should see disasters as warning signs before doom.
The good thing is that avoiding disasters makes good business. In fact, I don’t expect AI labs to require any “altrusim” to focus their attention on alignment and safety. This survey by Timothy Lee on self-driving cars notes that after a single tragic incident in which an Uber self-driving car killed a pedestrian, “Uber’s self-driving division never really recovered from the crash, and Uber sold it off in 2020. The rest of the industry vowed not to repeat Uber’s mistake.” Given that a single disaster can be extremely hard to recover from, smart leaders of AI labs should focus on safety, even if it means being a little slower to the market.
While the initial push is to get AI to match human capabilities, as these tools become more than impressive demos and need to be deployed in the field, the customers will care much more about reliability and safety than they do about capabilities. If I am a software company using an AI system as a programmer, it’s more useful to me if it can reliably deliver bug-free 100-line subroutines than if it writes 10K sized programs that might contain subtle bugs. There is a reason why much of the programming infrastructure for real-world projects, including pull requests, code reviews, unit tests, is not aimed at getting something that kind of works out as quickly as possible, but rather make sure that the codebase grows in a reliable and maintainable fashion.
This doesn’t mean that the free market can take care of everything and that regulations are not needed to ensure that some companies don’t make a quick profit by deploying unsafe products and pushing off externalities to their users and the broader environment. (Indeed, some would say that this was done in the self-driving domain...) But I do think there is a big commercial incentive for AI labs to invest in research on how to ensure that systems pushed out behave in a predictable manner, and don’t start maximizing paperclips.
p.s. The nuclear setting also gives another lesson (TW: grim calculations follow). It is much more than a factor of two harder to extinguish 100% of the population than to kill the ~50% or so that live in large metropolitan areas. Generally, the difference between the effort needed to kill 50% of the population and the effort to kill a 1-p fraction should scale at least as 1/p.
Thanks for the thoughts! I agree that there will likely be commercial incentives for some amount of risk reduction, though I worry that the incentives will trail off before the needs trail off—more on that here and here.
These are interesting! And deserve more discussion than just a comment.
But one high level point regarding “deception” is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this—second pair I tried).
While in other cases in computing technology we talk about “five nine’s reliability”, in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which is part of why self-driving cars are not deployed yet.
If we cannot even make AIs be perfect at the task that they were explicitly made to perform, there is no reason to imagine they would be even close to perfect at deception either.
I agree that today’s AI systems aren’t highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that’s a double-edged sword.
Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?
At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm A that for every input x computes some function f with probability at least 2⁄3 (i.e. Pr[A(x)=f(x)]≥2/3) then if we spend k times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with k (i.e.Pr[Ak(x)=f(x)]≥1−exp(−k/10) or something like that where Ak is the algorithm obtained by scaling up A to compute it k times and output the plurality value).
This is not something special to randomized algorithms. This also holds in the context of noisy communication and error correcting codes, and many other settings. Often we can get to 1−δ success at a price of O(log(1/δ)) , which is why we can get things like “five nines reliability” in several engineering fields.
In contrast, so far all our scaling laws show that when we scale our neural networks by spending a factor of k more computation, we only get a reduction in the error that looks like k−α so it’s polynomial rather than exponential, and even the exponent of the polynomial is not that great (and in particular smaller than one).
So while I agree that scaling up will yield progress on reliability as well, at least with our current methods, it seems that we would do things that are 10 or 100 times more impressive than what we do now, before we get to the type of 99.9% and better reliability on the things that we currently do. Getting to do something that is both super-human in capability as well as has such a tiny probability of failure that it would not be detected seems much further off.
That’s interesting, thanks!
In addition to some generalized concern about “unknown unknowns” leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:
Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
Being able to use that effort productively, via things like “trying multiple angles on a question” and “setting up systems for error checking.”
I think that in some sense humans are quite unreliable, and use a lot of scaffolding—variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it’s really important to deceive someone, they’re going to make a lot of use of things like this).
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like “chain of thought” already spend more time on inference compute to get better performance, but they don’t really have a “knob” we can turn so we can control the computation/reliability tradeoff.)
Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)
In particular, even if we apply inference time compute, unfortunately I don’t think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.