Recently I came across a 2018 Thesis by Tom Everitt that claims Utility Uncertainty (referred to as CIRL in the paper) does not circumvent the problem of reward corruption. I read through that section and the examples it gives seem convincing and would like to hear your views on it. Also, do you know about a resource that gathers objections to using utility uncertainty as a framework to solve AI alignment (especially since there have been at least two research papers that provide “solutions” to CIRL indicating that it might be soon applicable to real-world applications)?
Towards Safe Artificial General Intelligence (Tom Everitt):
Recently I came across a 2018 Thesis by Tom Everitt that claims Utility Uncertainty (referred to as CIRL in the paper) does not circumvent the problem of reward corruption. I read through that section and the examples it gives seem convincing and would like to hear your views on it. Also, do you know about a resource that gathers objections to using utility uncertainty as a framework to solve AI alignment (especially since there have been at least two research papers that provide “solutions” to CIRL indicating that it might be soon applicable to real-world applications)?
Towards Safe Artificial General Intelligence (Tom Everitt):
(https://openresearch-repository.anu.edu.au/bitstream/1885/164227/1/Tom%20Everitt%20Thesis%202018.pdf)
An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning (D Malik et al.)
(http://proceedings.mlr.press/v80/malik18a/malik18a.pdf)
Pragmatic-Pedagogic Value Alignment (JF Fisac et al.):
(https://arxiv.org/pdf/1707.06354.pdf)