Aiyen comments on Alignment and Deep Learning

Aiyen 17 Apr 2022 5:00 UTC
3 points
The idea is very close to approval-directed agents, but with the process automated to provide more training than would be feasible with humans providing the feedback, and potentially with other benefits as well, such as adversarial learning (which a human could not necessarily provide as well as a purpose-trained AI) and learning to reproduce the neural net comprising the user (which would be much easier to try with an AI user than a human, and which could potentially provide a more rigorous test of when the predictor is actually becoming aligned vs developing harmful mesa optimizers).
Inverse reinforcement learning, if I understand correctly, involves a human and AI working together. While that might be helpful, it seems unlikely that solely human-supervised learning would work as well as having the option to train a system unsupervised. Certainly this would not have been enough for AlphaGo!
- interstice 17 Apr 2022 17:59 UTC
  2 points
  Parent
  
  Inverse reinforcement learning, if I understand correctly, involves a human and AI working together
  
  I think IRL just refers to the general setup of trying to infer an agent’s goals from its actions(and possibly communication/interaction with the agent). So you wouldn’t need to learn the human utility function purely from human feedback. Although, I don’t think relying on human feedback would necessarily be a deal-breaker—seems like most of the work of making a powerful AI comes from giving it a good general world model, capabilities etc, and it’s okay if the data specifying human utility is relatively sparse(although still large in objective terms, perhaps many many books long) compared to all the rest of the data the model is being trained on. In the AlphaGo example, this would be kinda like learning the goal state from direct feedback, but getting good at the game through self-play.