I have a few questions about corrigibility. First, I will tentatively define corrigibility as creating an agent who is willing to let humans shut it off or change its goals without manipulating humans. I have seen that corrigibility can lead to VNM-incoherence (i.e. an agent can be dutch-booked / money-pumped). Has this result been proven in general?
Also, what is the current state of corrigibility research? If the above incoherence result turns out to be correct and corrigibility leads to incoherence, are there any other tractable theoretical directions we could take towards corrigibility?
Are any people trying to create corrigible agents in practice? (I suspect it is unwise to try this, as any poorly understood corrigibility we manage to implement in practice is liable to be wiped away if a sharp left turn occurs).
I have a few questions about corrigibility. First, I will tentatively define corrigibility as creating an agent who is willing to let humans shut it off or change its goals without manipulating humans. I have seen that corrigibility can lead to VNM-incoherence (i.e. an agent can be dutch-booked / money-pumped). Has this result been proven in general?
Also, what is the current state of corrigibility research? If the above incoherence result turns out to be correct and corrigibility leads to incoherence, are there any other tractable theoretical directions we could take towards corrigibility?
Are any people trying to create corrigible agents in practice? (I suspect it is unwise to try this, as any poorly understood corrigibility we manage to implement in practice is liable to be wiped away if a sharp left turn occurs).