interstice comments on Alignment and Deep Learning

interstice 17 Apr 2022 17:59 UTC
2 points

Inverse reinforcement learning, if I understand correctly, involves a human and AI working together

I think IRL just refers to the general setup of trying to infer an agent’s goals from its actions(and possibly communication/interaction with the agent). So you wouldn’t need to learn the human utility function purely from human feedback. Although, I don’t think relying on human feedback would necessarily be a deal-breaker—seems like most of the work of making a powerful AI comes from giving it a good general world model, capabilities etc, and it’s okay if the data specifying human utility is relatively sparse(although still large in objective terms, perhaps many many books long) compared to all the rest of the data the model is being trained on. In the AlphaGo example, this would be kinda like learning the goal state from direct feedback, but getting good at the game through self-play.