I’m looking for feedback on my understanding of Corrigibility.
I skimmed CHAI, Assistance Games, And Fully Updated Deferece . Is the key criticism that The Off-Switch Game ignores the embeddedness of the human? i.e. The agent will stochastically model the human and so will just try to predict what the human will say instead of asking it? This limitation is mentioned on Page 3.
Is this criticism correct or am I missing something else?
I’m looking for feedback on my understanding of Corrigibility.
I skimmed CHAI, Assistance Games, And Fully Updated Deferece . Is the key criticism that The Off-Switch Game ignores the embeddedness of the human? i.e. The agent will stochastically model the human and so will just try to predict what the human will say instead of asking it? This limitation is mentioned on Page 3.
Is this criticism correct or am I missing something else?