I think the key point is that utility uncertainty does not in itself produce corrigibility after updating on all of the evidence. So you still need to write down a value learning procedure which produces the right answer in the limit of infinite data. Many people working on outer alignment think that’s a very difficult step, and are excited about something like corrigibility because it could provide an “out” that saves you from needing to solve that problem; they view fully-updated-deference is an argument that utility uncertainty can’t provide such an “out.” One way of putting this is that these researchers are mostly resigned to a prior mis-specification and/or the unidentifiability of human values given availability of data, such that they are unhappy “factoring out” that problem (e.g. see Jacob Steinhardt and Owain Evan’s post on on misspecifiction in value learning).
Here is a related post I wrote, the easy goal inference problem is still hard, trying to argue for what I view as the “hard core” of a value-learning-based approach. I consider fully-updated-deference a good argument that in the limit, utility uncertainty is not a way of dodging the basic difficulties with such a value-learning-based approach. Some other papers (especially out of CHAI) try to directly engage with realistic models of human errors in a way that could yield a solution of the easy goal inference problem, though I’m currently not persuaded that any of these would meaningfully address the main difficulties in outer alignment. (For example, I think it’s instructive imagining them as potential solutions to ELK.)
One could still be optimistic about utility uncertainty if you either thought “the limit is far away” or were optimistic about confronting the other difficulties with value learning. This is obviously especially appealing if you are legitimately worried about failures caused by the AI’s lack of understanding of what humans want. (I’m less excited about that because I think failure modes like “the AI murders everyone” are very unlikely to emerge from realistic uncertainty about what humans want, since this is a pretty obvious fact about human preferences.)
From a discussion with Stuart Russell, my understanding is that his belief is that the easy goal inference problem may be hard “in the limit,” but that it may be possible for cognitive science to “keep up” with AI progress so that we always have a good enough solution to value learning that we’d be happy with AIs optimizing our current best guess about human values, as defined by the best prior we can currently right down. I think this is mostly plausible if you imagine our AI alignment approaches as either a stopgap for a brief period, or if you imagine highly-automated AI cognitive science.
I discuss some other tangentially relevant issues in IRL and VOI, and in particular I contrast “corrigibility as preference” from “corrigibility as emergent phenomenon under reward uncertainty” as approaches to a basic problem for current RF optimization. This is closely related to ambitious vs narrow value learning.
Overall I think it’s plausible that narrow value learning works well enough for learning corrigibility, such that fully-updated deference wouldn’t be a big problem / you wouldn’t need any clever approach to corrigibility. But even in that case, I’m not convinced that reward uncertainty is addressing the major problems and I think the important problems are being addressed in other parts of the design.
It’s also worth briefly mentioning that CIRL doesn’t necessarily have to proceed through an explicit reward uncertainty approach, and so an Eliezer or Richard-like objection to CIRL itself might be more like “this is a problem restatement; it may be reasonable as a way of communicating to AI researchers what the problem is in a way that doesn’t talk about robots killing you, but it’s not an approach to that problem and so should be compared to other problem statements rather than other approaches.” (That said I’m not sure if this is actually the view of Eliezer and Richard, and my guess would be that they just don’t have a good understanding of how e.g. Dylan Hadfield-Menell thinks about what CIRL is.)
Note that this comment is referencing my own writing mostly because it’s primarily an expression of my own views, rather than claiming that I’m the first or most important person to make any of these points, etc.
This is really helpful, thanks. Perhaps the only disagreement here is pedagogical; I think it’s more useful to point people excited about utility uncertainty to the easy goal inference problem is still hard and to Model Mis-specification and Inverse Reinforcement Learning, because these engage directly with the premises of the approach. Arguing that it violates corrigibility, a concept that doesn’t fit cleanly in the CIRL framework, is more likely to lead to confusion than understanding the problems (at least it did for me).
On the object level, I basically agree with Russell that a good enough solution to value learning seems very valuable since it expands the level of AI capabilities we can deploy safely in the world and buys us more time—basically the “stopgap” approach you mention. Composed with other agendas like automating AI alignment research, it might even prove decisive.
And framing CIRL in particular as a problem formalization rather than a solution approach seems right. I’ve found it very helpful to have a precise mathematical object like “CIRL” to point to when discussing the alignment problem with AI researchers, in contrast to the clusters of blog posts defining things like “alignment” and “corrigibility”.
I think the key point is that utility uncertainty does not in itself produce corrigibility after updating on all of the evidence. So you still need to write down a value learning procedure which produces the right answer in the limit of infinite data. Many people working on outer alignment think that’s a very difficult step, and are excited about something like corrigibility because it could provide an “out” that saves you from needing to solve that problem; they view fully-updated-deference is an argument that utility uncertainty can’t provide such an “out.” One way of putting this is that these researchers are mostly resigned to a prior mis-specification and/or the unidentifiability of human values given availability of data, such that they are unhappy “factoring out” that problem (e.g. see Jacob Steinhardt and Owain Evan’s post on on misspecifiction in value learning).
Here is a related post I wrote, the easy goal inference problem is still hard, trying to argue for what I view as the “hard core” of a value-learning-based approach. I consider fully-updated-deference a good argument that in the limit, utility uncertainty is not a way of dodging the basic difficulties with such a value-learning-based approach. Some other papers (especially out of CHAI) try to directly engage with realistic models of human errors in a way that could yield a solution of the easy goal inference problem, though I’m currently not persuaded that any of these would meaningfully address the main difficulties in outer alignment. (For example, I think it’s instructive imagining them as potential solutions to ELK.)
One could still be optimistic about utility uncertainty if you either thought “the limit is far away” or were optimistic about confronting the other difficulties with value learning. This is obviously especially appealing if you are legitimately worried about failures caused by the AI’s lack of understanding of what humans want. (I’m less excited about that because I think failure modes like “the AI murders everyone” are very unlikely to emerge from realistic uncertainty about what humans want, since this is a pretty obvious fact about human preferences.)
From a discussion with Stuart Russell, my understanding is that his belief is that the easy goal inference problem may be hard “in the limit,” but that it may be possible for cognitive science to “keep up” with AI progress so that we always have a good enough solution to value learning that we’d be happy with AIs optimizing our current best guess about human values, as defined by the best prior we can currently right down. I think this is mostly plausible if you imagine our AI alignment approaches as either a stopgap for a brief period, or if you imagine highly-automated AI cognitive science.
I discuss some other tangentially relevant issues in IRL and VOI, and in particular I contrast “corrigibility as preference” from “corrigibility as emergent phenomenon under reward uncertainty” as approaches to a basic problem for current RF optimization. This is closely related to ambitious vs narrow value learning.
Overall I think it’s plausible that narrow value learning works well enough for learning corrigibility, such that fully-updated deference wouldn’t be a big problem / you wouldn’t need any clever approach to corrigibility. But even in that case, I’m not convinced that reward uncertainty is addressing the major problems and I think the important problems are being addressed in other parts of the design.
It’s also worth briefly mentioning that CIRL doesn’t necessarily have to proceed through an explicit reward uncertainty approach, and so an Eliezer or Richard-like objection to CIRL itself might be more like “this is a problem restatement; it may be reasonable as a way of communicating to AI researchers what the problem is in a way that doesn’t talk about robots killing you, but it’s not an approach to that problem and so should be compared to other problem statements rather than other approaches.” (That said I’m not sure if this is actually the view of Eliezer and Richard, and my guess would be that they just don’t have a good understanding of how e.g. Dylan Hadfield-Menell thinks about what CIRL is.)
Note that this comment is referencing my own writing mostly because it’s primarily an expression of my own views, rather than claiming that I’m the first or most important person to make any of these points, etc.
This is really helpful, thanks. Perhaps the only disagreement here is pedagogical; I think it’s more useful to point people excited about utility uncertainty to the easy goal inference problem is still hard and to Model Mis-specification and Inverse Reinforcement Learning, because these engage directly with the premises of the approach. Arguing that it violates corrigibility, a concept that doesn’t fit cleanly in the CIRL framework, is more likely to lead to confusion than understanding the problems (at least it did for me).
On the object level, I basically agree with Russell that a good enough solution to value learning seems very valuable since it expands the level of AI capabilities we can deploy safely in the world and buys us more time—basically the “stopgap” approach you mention. Composed with other agendas like automating AI alignment research, it might even prove decisive.
And framing CIRL in particular as a problem formalization rather than a solution approach seems right. I’ve found it very helpful to have a precise mathematical object like “CIRL” to point to when discussing the alignment problem with AI researchers, in contrast to the clusters of blog posts defining things like “alignment” and “corrigibility”.