We should try and nail down the concept of corrigibility when I’m in the US—are you in San Francisco currently?
I have three thoughts on your example. First of all, it does seem a better version of corrigibility than I’ve seen. Secondly, it doesn’t help much in those cases where the AI has to determine your preferences, like the “teach me about charities” example. And lastly, it puts a lot of weight on the AI successfully informing the human; it’s trivial to mislead the human with truthful answers, especially when manipulating the human is an instrumental goal for the AI.
“What are you trying to achieve in this conversation?” “Allow you to write your will to the best of your abilities, as specified in my programming.” That wouldn’t even be a lie...
(as I said at the end of the post, I have more hope on the accurate answer front; so maybe we could get that to work?)
We should try and nail down the concept of corrigibility when I’m in the US—are you in San Francisco currently?
I have three thoughts on your example. First of all, it does seem a better version of corrigibility than I’ve seen. Secondly, it doesn’t help much in those cases where the AI has to determine your preferences, like the “teach me about charities” example. And lastly, it puts a lot of weight on the AI successfully informing the human; it’s trivial to mislead the human with truthful answers, especially when manipulating the human is an instrumental goal for the AI.
“What are you trying to achieve in this conversation?” “Allow you to write your will to the best of your abilities, as specified in my programming.” That wouldn’t even be a lie...
(as I said at the end of the post, I have more hope on the accurate answer front; so maybe we could get that to work?)