Early in the ELK report, it mentions that ARC doesn’t believe that strategies like debate solves ELK in the worst case. Can I get some clarifications on why? Specifically, a debate inspired set-up for SafeVault could be something like:
We train the reporter to take a human belief as input (i.e. “The diamond is in the vault.”) and returns a “truthful” argument that is most likely to change the human’s belief.
We can guarantee “truthfulness” by for example restricting the output to be a video rendering of what happens in the vault from some camera angle.
I won’t comment on debate more generally, but can provide a counterexample for your specific proposal:
The predicter learns to deceive in a way that is undetectable to the observations the reporter has access to. As a result, the reporter gives an unconvincing argument, and the deception is missed. If it can already do this for the observations the humans have access to, it should also be able to do so for those the reporter has access to.
Early in the ELK report, it mentions that ARC doesn’t believe that strategies like debate solves ELK in the worst case. Can I get some clarifications on why? Specifically, a debate inspired set-up for SafeVault could be something like:
We train the reporter to take a human belief as input (i.e. “The diamond is in the vault.”) and returns a “truthful” argument that is most likely to change the human’s belief.
We can guarantee “truthfulness” by for example restricting the output to be a video rendering of what happens in the vault from some camera angle.
I won’t comment on debate more generally, but can provide a counterexample for your specific proposal:
The predicter learns to deceive in a way that is undetectable to the observations the reporter has access to. As a result, the reporter gives an unconvincing argument, and the deception is missed. If it can already do this for the observations the humans have access to, it should also be able to do so for those the reporter has access to.