riceissa comments on List of resolved confusions about IDA

riceissa 1 Oct 2019 6:23 UTC
5 points
I still feel confused about “distill ≈ RL”. In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like “In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with” and “Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption”.

Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing “distill ≈ RL” is still accurate?

ETA: From the FAQ for Paul’s agenda:

1.2.2: OK, so given this amplified aligned agent, how do you get the distilled agent?

Train a new agent via some combination of imitation learning (predicting the actions of the amplified aligned agent), semi-supervised reinforcement learning (where the amplified aligned agent helps specify the reward), and techniques for optimizing robustness (e.g. creating red teams that generate scenarios that incentivize subversion).

and:

The imitation learning is more about getting this new agent off the ground than about ensuring alignment. The bulk of the alignment guarantee comes from the semi-supervised reinforcement learning, where we train it to work on a wide range of tasks and answer questions about its cognition.