the existence of predicates on the world that are easier to evaluate than generate examples of (in the same way that verifying the answer to a problem in NP can be easier than generating it) guarantees that the model should be better at distinguishing between evaluation and deployment than any evaluator can be at tricking it into thinking it’s in deployment
Where does the guarantee come from? Why do we know that for this specific problem (generating vs. evaluating whether the model is deployed) it’s easier to evaluate? For many problems it’s equally difficult, right?
We know that there exist predicates which are easier to evaluate than generate examples of (e.g. RSA-2048), such that, given that the model is actively looking for such predicates, it will be able to use them to make evaluation easier than generation.
Where does the guarantee come from? Why do we know that for this specific problem (generating vs. evaluating whether the model is deployed) it’s easier to evaluate? For many problems it’s equally difficult, right?
We know that there exist predicates which are easier to evaluate than generate examples of (e.g. RSA-2048), such that, given that the model is actively looking for such predicates, it will be able to use them to make evaluation easier than generation.