Fun to see this is now being called ‘Holtman’s neglected result’. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:
I can’t fully fault the world for neglecting ‘Corrigibility with Utility Preservation’ because it is full of a lot of dense math.
I wrote two followup papers to ‘Corrigibility with Utility Preservation’ which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great!
Does anyone have a technical summary?
The best technical summary of ‘Corrigibility with Utility Preservation’ may be my sequence on counterfactual planning which shows that the corrigible agents from ‘Corrigibility with Utility Preservation’ can also be understood as agents that do utility maximisation in a pretend/counterfactual world model.
For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment.
In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of ‘solved’ are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an ‘Unstoppable Weasel’ in the Corrigibility with Utility Preservation paper.
Fun to see this is now being called ‘Holtman’s neglected result’. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:
I can’t fully fault the world for neglecting ‘Corrigibility with Utility Preservation’ because it is full of a lot of dense math.
I wrote two followup papers to ‘Corrigibility with Utility Preservation’ which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great!
The best technical summary of ‘Corrigibility with Utility Preservation’ may be my sequence on counterfactual planning which shows that the corrigible agents from ‘Corrigibility with Utility Preservation’ can also be understood as agents that do utility maximisation in a pretend/counterfactual world model.
For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment.
In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of ‘solved’ are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an ‘Unstoppable Weasel’ in the Corrigibility with Utility Preservation paper.