I think this is an important question, and this case for optimism can be a bit overstated when one glosses over the practical challenges to verification. There’s plenty of work on open-source game theory out there, but to my knowledge, none of these papers really discuss how one agent might gain sufficient evidence that it has been handed the other agent’s actual code.
We wrote this part under the assumption that AGIs might be able to just figure out these practical challenges in ways we can’t anticipate, which I think is plausible. But of course, an AGI might just as well be able to figure out ways to deceive other AGIs that we can’t anticipate. I’m not sure how the “offense-defense balance” here will change in the limit of smarter agents.
Hmm, if A is simulating B with B’s source code, couldn’t the simulated B find out it’s being simulated and lie about its decisions or hide what its actual preferences? Or would its actual preferences be derivable from its weights or code directly without simulation?
An AGI could give read and copy access to the code being run and the weights directly on the devices from which the AGI is communicating. That could still be a modified copy of the original and more powerful (or with many unmodified copies) AGI, though. So, the other side may need to track all of the copies, maybe even offline ones that would go online on some trigger or at some date.
Also, giving read and copy access could be dangerous to the AGI if it doesn’t have copies elsewhere.
I think this is an important question, and this case for optimism can be a bit overstated when one glosses over the practical challenges to verification. There’s plenty of work on open-source game theory out there, but to my knowledge, none of these papers really discuss how one agent might gain sufficient evidence that it has been handed the other agent’s actual code.
We wrote this part under the assumption that AGIs might be able to just figure out these practical challenges in ways we can’t anticipate, which I think is plausible. But of course, an AGI might just as well be able to figure out ways to deceive other AGIs that we can’t anticipate. I’m not sure how the “offense-defense balance” here will change in the limit of smarter agents.
Hmm, if A is simulating B with B’s source code, couldn’t the simulated B find out it’s being simulated and lie about its decisions or hide what its actual preferences? Or would its actual preferences be derivable from its weights or code directly without simulation?
An AGI could give read and copy access to the code being run and the weights directly on the devices from which the AGI is communicating. That could still be a modified copy of the original and more powerful (or with many unmodified copies) AGI, though. So, the other side may need to track all of the copies, maybe even offline ones that would go online on some trigger or at some date.
Also, giving read and copy access could be dangerous to the AGI if it doesn’t have copies elsewhere.