I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like “chain of thought” already spend more time on inference compute to get better performance, but they don’t really have a “knob” we can turn so we can control the computation/reliability tradeoff.)
Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)
In particular, even if we apply inference time compute, unfortunately I don’t think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like “chain of thought” already spend more time on inference compute to get better performance, but they don’t really have a “knob” we can turn so we can control the computation/reliability tradeoff.)
Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)
In particular, even if we apply inference time compute, unfortunately I don’t think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.