Hi! Thank you for writing this and suggesting solutions. I have a number of points to discuss. Apologies in advance for all the references to Arbital, it’s a really nice resource.
The AI will hack the system and produce outputs that it’s not theoretically meant to be able to produce at all.
In the first paragraphs following this, you describe this first kind of misalignment as an engineering problem, where you try to guarantee that the instructions that are run on the hardware correspond exactly to the code you are running; being robust from hardware tampering.
I argue that this is actually a subset of the second kind of misalignment. You may have solved the initial engineering problem that at the start the hardware does what the software says, but the agent’s own hardware is part of the world, and so can plausibly be influenced by whatever the agent outputs.
You can attempt to specifically bar the agent from taking actions that target its hardware; that is not a hardware problem, but your second kind of misalignment. For any sufficiently advanced agent, which may find cleverer strategies than the cleverest hacker, no hardware is safe.
Plus, the agent’s hardware may have more parts than you expect as long as it can interact with the outside world. We still have a long way to go before being confident about that part.
Of course the problem with this oracle is that it’s far too inefficient. On every single run we can get at most 1 bit of information, but for that one bit of information we’re running a superhuman artificial intelligence. By the time it becomes practical, ordinary superhuman AIs will have been developed by someone else and destroyed the world.
There are other problems, for instance how can you be sure that the agent hasn’t figured out how to game the automated theorem prover to validate its proofs. You conclusion seems to be that if we manage to make safe enough, it will become impractical enough. But if you try to get more than one bit of information, you run into other issues.
This satisfies our second requirement—we can verify the AIs solution, so we can tell if it’s lying. There’s also some FNP problems which satisfy the first requirement—there’s only one right answer. For example, finding the prime factors of an integer.
Here the verification process is no longer an automated process, it’s us. You correctly point out that most useful problems have various possible solutions, and the more information we feed the agent, the more likely it will be able to find some solution that exploit our flaws and… start a nuclear war, in your example.
In other words, I’m asking: is there a hidden assumption that, in the process of solving FNP problems, the agent will need to explore dangerous plans?
A superintelligent FNP problem solver would be a huge boon towards building AIs that provably had properties which are useful for alignment. Maybe it’s possible to reduce the question “build an aligned AI” to an FNP problem, and even if not, some sub-parts of that problem definitely should be reducible.
I would say that building a safe superintelligent FNP solver requires solving AI alignment in the first place. A less powerful FNP solver could maybe help with sub-parts of the problem… which ones?
Hi! Thank you for writing this and suggesting solutions. I have a number of points to discuss. Apologies in advance for all the references to Arbital, it’s a really nice resource.
In the first paragraphs following this, you describe this first kind of misalignment as an engineering problem, where you try to guarantee that the instructions that are run on the hardware correspond exactly to the code you are running; being robust from hardware tampering.
I argue that this is actually a subset of the second kind of misalignment. You may have solved the initial engineering problem that at the start the hardware does what the software says, but the agent’s own hardware is part of the world, and so can plausibly be influenced by whatever the agent outputs.
You can attempt to specifically bar the agent from taking actions that target its hardware; that is not a hardware problem, but your second kind of misalignment. For any sufficiently advanced agent, which may find cleverer strategies than the cleverest hacker, no hardware is safe.
Plus, the agent’s hardware may have more parts than you expect as long as it can interact with the outside world. We still have a long way to go before being confident about that part.
There are other problems, for instance how can you be sure that the agent hasn’t figured out how to game the automated theorem prover to validate its proofs. You conclusion seems to be that if we manage to make safe enough, it will become impractical enough. But if you try to get more than one bit of information, you run into other issues.
Here the verification process is no longer an automated process, it’s us. You correctly point out that most useful problems have various possible solutions, and the more information we feed the agent, the more likely it will be able to find some solution that exploit our flaws and… start a nuclear war, in your example.
I am confused by your setup, which seems to be trying to make it harder for the agent to harm us, when it shouldn’t even be trying to harm us in the first place.
In other words, I’m asking: is there a hidden assumption that, in the process of solving FNP problems, the agent will need to explore dangerous plans?
I would say that building a safe superintelligent FNP solver requires solving AI alignment in the first place. A less powerful FNP solver could maybe help with sub-parts of the problem… which ones?