tailcalled comments on A shot at the diamond-alignment problem

tailcalled 8 Oct 2022 7:13 UTC
2 points
−4
Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.

If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.

So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.