My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.
My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.