TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 8 Oct 2022 0:05 UTC
LW: 2 AF: 2
0
AF
What about this post?
- tailcalled 8 Oct 2022 7:13 UTC
  2 points
  −4
  Parent
  Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.
  
  If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
  
  So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.