In a recent post, Wei Dai mentions a similar distinction (italics added by me):
Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen.
I think I’ve updated a bit from when I wrote this (due to this discussion). (ETA: I’ve now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:
The loss computation for SL is typically simpler than the reward computation for RL, and therefore more secure / harder to hack, but maybe we shouldn’t depend on that for safety.
SL doesn’t explore, so it can’t “stumble onto” a way to hack the reward/loss computation like RL can. But it can still learn to hack the loss computation or the training label if the model becomes a mesa optimizer that cares about minimizing “loss” (e.g., the output of the physical loss computation) as either a terminal or instrumental goal. In other words, if reward/loss hacking happens with SL, the optimization power for it seemingly has to come from a mesa optimizer, whereas for RL it could come from either the base or mesa optimizer.
I think I’ve updated a bit from when I wrote this (due to this discussion). (ETA: I’ve now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:
The loss computation for SL is typically simpler than the reward computation for RL, and therefore more secure / harder to hack, but maybe we shouldn’t depend on that for safety.
SL doesn’t explore, so it can’t “stumble onto” a way to hack the reward/loss computation like RL can. But it can still learn to hack the loss computation or the training label if the model becomes a mesa optimizer that cares about minimizing “loss” (e.g., the output of the physical loss computation) as either a terminal or instrumental goal. In other words, if reward/loss hacking happens with SL, the optimization power for it seemingly has to come from a mesa optimizer, whereas for RL it could come from either the base or mesa optimizer.