These are interesting! And deserve more discussion than just a comment.
But one high level point regarding “deception” is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this—second pair I tried). While in other cases in computing technology we talk about “five nine’s reliability”, in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which is part of why self-driving cars are not deployed yet.
If we cannot even make AIs be perfect at the task that they were explicitly made to perform, there is no reason to imagine they would be even close to perfect at deception either.
I agree that today’s AI systems aren’t highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that’s a double-edged sword.
Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?
At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm A that for every input x computes some function f with probability at least 2⁄3 (i.e. Pr[A(x)=f(x)]≥2/3) then if we spend k times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with k (i.e.Pr[Ak(x)=f(x)]≥1−exp(−k/10) or something like that where Ak is the algorithm obtained by scaling up A to compute it k times and output the plurality value).
This is not something special to randomized algorithms. This also holds in the context of noisy communication and error correcting codes, and many other settings. Often we can get to 1−δ success at a price of O(log(1/δ)) , which is why we can get things like “five nines reliability” in several engineering fields.
In contrast, so far all our scaling laws show that when we scale our neural networks by spending a factor of k more computation, we only get a reduction in the error that looks like k−α so it’s polynomial rather than exponential, and even the exponent of the polynomial is not that great (and in particular smaller than one).
So while I agree that scaling up will yield progress on reliability as well, at least with our current methods, it seems that we would do things that are 10 or 100 times more impressive than what we do now, before we get to the type of 99.9% and better reliability on the things that we currently do. Getting to do something that is both super-human in capability as well as has such a tiny probability of failure that it would not be detected seems much further off.
In addition to some generalized concern about “unknown unknowns” leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:
Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
Being able to use that effort productively, via things like “trying multiple angles on a question” and “setting up systems for error checking.”
I think that in some sense humans are quite unreliable, and use a lot of scaffolding—variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it’s really important to deceive someone, they’re going to make a lot of use of things like this).
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like “chain of thought” already spend more time on inference compute to get better performance, but they don’t really have a “knob” we can turn so we can control the computation/reliability tradeoff.)
Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)
In particular, even if we apply inference time compute, unfortunately I don’t think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.
These are interesting! And deserve more discussion than just a comment.
But one high level point regarding “deception” is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this—second pair I tried).
While in other cases in computing technology we talk about “five nine’s reliability”, in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which is part of why self-driving cars are not deployed yet.
If we cannot even make AIs be perfect at the task that they were explicitly made to perform, there is no reason to imagine they would be even close to perfect at deception either.
I agree that today’s AI systems aren’t highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that’s a double-edged sword.
Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?
At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm A that for every input x computes some function f with probability at least 2⁄3 (i.e. Pr[A(x)=f(x)]≥2/3) then if we spend k times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with k (i.e.Pr[Ak(x)=f(x)]≥1−exp(−k/10) or something like that where Ak is the algorithm obtained by scaling up A to compute it k times and output the plurality value).
This is not something special to randomized algorithms. This also holds in the context of noisy communication and error correcting codes, and many other settings. Often we can get to 1−δ success at a price of O(log(1/δ)) , which is why we can get things like “five nines reliability” in several engineering fields.
In contrast, so far all our scaling laws show that when we scale our neural networks by spending a factor of k more computation, we only get a reduction in the error that looks like k−α so it’s polynomial rather than exponential, and even the exponent of the polynomial is not that great (and in particular smaller than one).
So while I agree that scaling up will yield progress on reliability as well, at least with our current methods, it seems that we would do things that are 10 or 100 times more impressive than what we do now, before we get to the type of 99.9% and better reliability on the things that we currently do. Getting to do something that is both super-human in capability as well as has such a tiny probability of failure that it would not be detected seems much further off.
That’s interesting, thanks!
In addition to some generalized concern about “unknown unknowns” leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:
Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
Being able to use that effort productively, via things like “trying multiple angles on a question” and “setting up systems for error checking.”
I think that in some sense humans are quite unreliable, and use a lot of scaffolding—variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it’s really important to deceive someone, they’re going to make a lot of use of things like this).
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like “chain of thought” already spend more time on inference compute to get better performance, but they don’t really have a “knob” we can turn so we can control the computation/reliability tradeoff.)
Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)
In particular, even if we apply inference time compute, unfortunately I don’t think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.