OpenAI has a golden opportunity with o3 (and o4) to collect a large number of samples of the type of deceptive behavior that is actually useful for increasing performance in the benchmarks they’re hill-climbing on.
There is at least one happy way and one sad way they could use such a dataset.
On the sad side, they could of course just build a reward hacking classifier and then do RL based on the output of that classifier. I expect this leads reward hacky behavior to become more subtle, but doesn’t eliminate it and so later versions of the model still can’t be trusted to competently execute hard-to-verify tasks. I doubt this is x-risky, because “a model that is super smart but can’t be trusted to do anything useful and also can’t delegate subtasks to copies of itself” just doesn’t seem that scary, but it’s still a sad path.
On the happy side, they could build a classifier and test various RL configurations to determine how quickly each configuration introduces deceptive reward hacky behavior. In other words, instead of using the classifier to train the model to exhibit less deceptive/reward-hacky behavior, they could use it to train their engineers to build less deceptive/reward-hacky models.
At some point gradient hacking is likely to become a concern, but for the time span we find ourselves in a situation where sandbagging is obvious and we have a clear signal of deceptiveness, it would be great if we could take full advantage of that situation.
OpenAI has a golden opportunity with o3 (and o4) to collect a large number of samples of the type of deceptive behavior that is actually useful for increasing performance in the benchmarks they’re hill-climbing on.
There is at least one happy way and one sad way they could use such a dataset.
On the sad side, they could of course just build a reward hacking classifier and then do RL based on the output of that classifier. I expect this leads reward hacky behavior to become more subtle, but doesn’t eliminate it and so later versions of the model still can’t be trusted to competently execute hard-to-verify tasks. I doubt this is x-risky, because “a model that is super smart but can’t be trusted to do anything useful and also can’t delegate subtasks to copies of itself” just doesn’t seem that scary, but it’s still a sad path.
On the happy side, they could build a classifier and test various RL configurations to determine how quickly each configuration introduces deceptive reward hacky behavior. In other words, instead of using the classifier to train the model to exhibit less deceptive/reward-hacky behavior, they could use it to train their engineers to build less deceptive/reward-hacky models.
At some point gradient hacking is likely to become a concern, but for the time span we find ourselves in a situation where sandbagging is obvious and we have a clear signal of deceptiveness, it would be great if we could take full advantage of that situation.