I hadn’t realized this post was nominated, partially because of my comment, so here’s a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.
Since writing the comment, I’ve come across another argument for thinking about intent alignment—it seems like a “generalization” of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explicitly maintains a distribution over possible human reward functions, and instrumentally gathers information about human preferences by interacting with the human. With intent alignment, since the agent is trying to help the human, we expect the agent to instrumentally maintain a belief over what the human cares about, and gather information to refine this belief. We might hope that there are ways to achieve intent alignment that instrumentally incentivizes all the nice behaviors of assistance games, without requiring the modeling assumptions that CIRL does (e.g. that the human has a fixed known reward function).
Changes I’d make to my comment:
It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction.
I still think that the intent alignment / motivation problem is the most urgent, but there are certainly other problems that matter as well, so I would probably remove or clarify that point.
I hadn’t realized this post was nominated, partially because of my comment, so here’s a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.
Since writing the comment, I’ve come across another argument for thinking about intent alignment—it seems like a “generalization” of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explicitly maintains a distribution over possible human reward functions, and instrumentally gathers information about human preferences by interacting with the human. With intent alignment, since the agent is trying to help the human, we expect the agent to instrumentally maintain a belief over what the human cares about, and gather information to refine this belief. We might hope that there are ways to achieve intent alignment that instrumentally incentivizes all the nice behaviors of assistance games, without requiring the modeling assumptions that CIRL does (e.g. that the human has a fixed known reward function).
Changes I’d make to my comment:
I still think that the intent alignment / motivation problem is the most urgent, but there are certainly other problems that matter as well, so I would probably remove or clarify that point.