I don’t think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we’d probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that’s not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of “trying to train corrigibility” or “trying to design corrigible systems,” I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say—like “there are so many ways to mess with you, how could a definition cover all of them?”—doesn’t make any progress on that, and so it doesn’t help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
I don’t think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we’d probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that’s not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of “trying to train corrigibility” or “trying to design corrigible systems,” I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say—like “there are so many ways to mess with you, how could a definition cover all of them?”—doesn’t make any progress on that, and so it doesn’t help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)