I still feel confused about “distill ≈ RL”. In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like “In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with” and “Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption”.
Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing “distill ≈ RL” is still accurate?
1.2.2: OK, so given this amplified aligned agent, how do you get the distilled agent?
Train a new agent via some combination of imitation learning (predicting the actions of the amplified aligned agent), semi-supervised reinforcement learning (where the amplified aligned agent helps specify the reward), and techniques for optimizing robustness (e.g. creating red teams that generate scenarios that incentivize subversion).
The imitation learning is more about getting this new agent off the ground than about ensuring alignment. The bulk of the alignment guarantee comes from the semi-supervised reinforcement learning, where we train it to work on a wide range of tasks and answer questions about its cognition.
I still feel confused about “distill ≈ RL”. In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like “In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with” and “Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption”.
Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing “distill ≈ RL” is still accurate?
ETA: From the FAQ for Paul’s agenda: