For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other’s work.
That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can’t be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.
And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).
You’d need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them from coordinating against you and bypassing the review system. Which means having actual humans between the scientists, which even if it works must slow things down so much that the simulated scientists probably can’t even design the nanobots on time.
Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.
I guess this is true, but only because the individual scientist AI that you train is only human-level (so the training is safe), and then you amplify it to superhuman level with many copies. If you train a powerful AI directly then there must be such a distributional shift (unless you just don’t care about making the training safe, in which case you die during the training).
Roll to disbelief. Cooperation is a natural equilibrium in many games.
Cooperation and corrigibility are very different things. Arguably, corrigibility is being indifferent with operators defecting against you. It’s forcing the agent to behave like CooperateBot with the operators, even when the operators visibly want to destroy it. This strategy does not arise as a natural equilibrium in multi-agent games.
Sure you can. Just train an AI that “wants” to be honest. This probably means training an AI with the objective function “accurately predict reality”
If this we knew how to do this then it would indeed solve point 31 for this specific AI and actually be pretty useful. But the reason we have ELK as an unsolved problem going around is precisely that we don’t know any way of doing that.
How do you know that an AI trained to accurately predict reality actually does that, instead of “accurately predict reality if it’s less than 99% sure it can take over the world, and take over the world otherwise”. If you have to rely on behavioral inspection and can’t directly read the AI’s mind, then your only chance of distinguishing between the two is misleading the AI into thinking that it can take over the world and observing it as it attempts to do so, which doesn’t scale as the AI becomes more powerful.
I’m virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.
Yes, but this is not the point. The point is that if you just show them the design, they would not by themselves understand or predict beforehand that cold air will come out. You’d have to also provide them with an explanation of thermodynamics and how the air conditioner exploits its laws. And I’m quite confident that you could also convince Aristotle or DaVinci that the air conditioner works by concentrating and releasing phlogiston, and therefore the air will come out hot.
I think I mostly agree with you on the other points.
That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can’t be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.
And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).
You’d need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them from coordinating against you and bypassing the review system. Which means having actual humans between the scientists, which even if it works must slow things down so much that the simulated scientists probably can’t even design the nanobots on time.
I guess this is true, but only because the individual scientist AI that you train is only human-level (so the training is safe), and then you amplify it to superhuman level with many copies. If you train a powerful AI directly then there must be such a distributional shift (unless you just don’t care about making the training safe, in which case you die during the training).
Cooperation and corrigibility are very different things. Arguably, corrigibility is being indifferent with operators defecting against you. It’s forcing the agent to behave like CooperateBot with the operators, even when the operators visibly want to destroy it. This strategy does not arise as a natural equilibrium in multi-agent games.
If this we knew how to do this then it would indeed solve point 31 for this specific AI and actually be pretty useful. But the reason we have ELK as an unsolved problem going around is precisely that we don’t know any way of doing that.
How do you know that an AI trained to accurately predict reality actually does that, instead of “accurately predict reality if it’s less than 99% sure it can take over the world, and take over the world otherwise”. If you have to rely on behavioral inspection and can’t directly read the AI’s mind, then your only chance of distinguishing between the two is misleading the AI into thinking that it can take over the world and observing it as it attempts to do so, which doesn’t scale as the AI becomes more powerful.
Yes, but this is not the point. The point is that if you just show them the design, they would not by themselves understand or predict beforehand that cold air will come out. You’d have to also provide them with an explanation of thermodynamics and how the air conditioner exploits its laws. And I’m quite confident that you could also convince Aristotle or DaVinci that the air conditioner works by concentrating and releasing phlogiston, and therefore the air will come out hot.
I think I mostly agree with you on the other points.