Right. And it’s not even sufficient to set up an adversary that goes ‘the model behaves badly if you satisfy X condition’, because there’s no guarantee that X condition is actually satisfiable.
You could ask the adversary to come up with an existence proof… but then the agent could do e.g. ‘do X if you see something with SHA512 hash of all zeroes’, which is strongly conjectured to exist but isn’t proven.
Right. And it’s not even sufficient to set up an adversary that goes ‘the model behaves badly if you satisfy X condition’, because there’s no guarantee that X condition is actually satisfiable.
You could ask the adversary to come up with an existence proof… but then the agent could do e.g. ‘do X if you see something with SHA512 hash of all zeroes’, which is strongly conjectured to exist but isn’t proven.