Wei Dai comments on Random Thoughts on Predict-O-Matic

Wei Dai 18 Oct 2019 17:18 UTC
LW: 7 AF: 5
AF
In a recent post, Wei Dai mentions a similar distinction (italics added by me):

Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen.

I think I’ve updated a bit from when I wrote this (due to this discussion). (ETA: I’ve now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:
1. The loss computation for SL is typically simpler than the reward computation for RL, and therefore more secure / harder to hack, but maybe we shouldn’t depend on that for safety.
2. SL doesn’t explore, so it can’t “stumble onto” a way to hack the reward/loss computation like RL can. But it can still learn to hack the loss computation or the training label if the model becomes a mesa optimizer that cares about minimizing “loss” (e.g., the output of the physical loss computation) as either a terminal or instrumental goal. In other words, if reward/loss hacking happens with SL, the optimization power for it seemingly has to come from a mesa optimizer, whereas for RL it could come from either the base or mesa optimizer.
What links here?
- Counterfactual Oracles = online supervised learning with random selection of training episodes by Wei Dai (10 Sep 2019 8:29 UTC; 52 points)
- Training goals for large language models by Johannes Treutlein (18 Jul 2022 7:09 UTC; 28 points)