This is really helpful, thanks. Perhaps the only disagreement here is pedagogical; I think it’s more useful to point people excited about utility uncertainty to the easy goal inference problem is still hard and to Model Mis-specification and Inverse Reinforcement Learning, because these engage directly with the premises of the approach. Arguing that it violates corrigibility, a concept that doesn’t fit cleanly in the CIRL framework, is more likely to lead to confusion than understanding the problems (at least it did for me).
On the object level, I basically agree with Russell that a good enough solution to value learning seems very valuable since it expands the level of AI capabilities we can deploy safely in the world and buys us more time—basically the “stopgap” approach you mention. Composed with other agendas like automating AI alignment research, it might even prove decisive.
And framing CIRL in particular as a problem formalization rather than a solution approach seems right. I’ve found it very helpful to have a precise mathematical object like “CIRL” to point to when discussing the alignment problem with AI researchers, in contrast to the clusters of blog posts defining things like “alignment” and “corrigibility”.
This is really helpful, thanks. Perhaps the only disagreement here is pedagogical; I think it’s more useful to point people excited about utility uncertainty to the easy goal inference problem is still hard and to Model Mis-specification and Inverse Reinforcement Learning, because these engage directly with the premises of the approach. Arguing that it violates corrigibility, a concept that doesn’t fit cleanly in the CIRL framework, is more likely to lead to confusion than understanding the problems (at least it did for me).
On the object level, I basically agree with Russell that a good enough solution to value learning seems very valuable since it expands the level of AI capabilities we can deploy safely in the world and buys us more time—basically the “stopgap” approach you mention. Composed with other agendas like automating AI alignment research, it might even prove decisive.
And framing CIRL in particular as a problem formalization rather than a solution approach seems right. I’ve found it very helpful to have a precise mathematical object like “CIRL” to point to when discussing the alignment problem with AI researchers, in contrast to the clusters of blog posts defining things like “alignment” and “corrigibility”.