Ivan Vendrov comments on Alignment research exercises

Ivan Vendrov 15 Jun 2022 18:47 UTC
1 point
AF
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant—it makes a number of assumptions about agents and utility functions and I wasn’t able to connect it to why I should expect an agent trained using CIRL to kill me.
FWIW here’s my alternative answer:
CIRL agents are bottlenecked on the human overseer’s ability to provide them with a learning signal through demonstration or direct communication. This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.
In other words it’s only a solution to “Learn from Teacher” in Paul’s 2019 decomposition of alignment, not to the whole alignment problem.