This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn’t uncertain is trivial.
Just create a utility function that assigns intrinsic reward to shutting itself off, or create a payoff matrix that punishes it really hard if it doesn’t turn itself off. In this context using this kind of technical language feels actively deceitful to me, since it’s really obvious that the argument he is making in that chapter cannot actually be a proof.
In general, I… really don’t understand Stuart Russell’s thoughts on AI Alignment. The whole “uncertainty over utility functions” thing just doesn’t really help at all with solving any part of the AI Alignment problem that I care about, and I do find myself really frustrated with the degree to which both this preface and Human Compatible repeatedly indicate that it somehow is a solution to the AI Alignment problem (not only like, a helpful contribution, but both this and Human Compatible repeatedly say things that to me read like “if you make the AI uncertain about the objective in the right way, then the AI Alignment problem is solved”, which just seems obviously wrong to me, since it doesn’t even deal with inner alignment problems, and it also doesn’t solve really any major outer alignment problems, but that requires a bit more writing).
My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.
This sentence really makes no sense to me. The proof that it can have an incentive to allow itself to be switched off even if it isn’t uncertain is trivial.
Just create a utility function that assigns intrinsic reward to shutting itself off, or create a payoff matrix that punishes it really hard if it doesn’t turn itself off. In this context using this kind of technical language feels actively deceitful to me, since it’s really obvious that the argument he is making in that chapter cannot actually be a proof.
In general, I… really don’t understand Stuart Russell’s thoughts on AI Alignment. The whole “uncertainty over utility functions” thing just doesn’t really help at all with solving any part of the AI Alignment problem that I care about, and I do find myself really frustrated with the degree to which both this preface and Human Compatible repeatedly indicate that it somehow is a solution to the AI Alignment problem (not only like, a helpful contribution, but both this and Human Compatible repeatedly say things that to me read like “if you make the AI uncertain about the objective in the right way, then the AI Alignment problem is solved”, which just seems obviously wrong to me, since it doesn’t even deal with inner alignment problems, and it also doesn’t solve really any major outer alignment problems, but that requires a bit more writing).
My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.