The human-off-button doesn’t help Russell’s argument with respect to the weakness under discussion.
It’s the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn’t be described as instrumental convergence.
Russell’s argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.
Isnasene’s point above is closer to that, but that’s not the argument that Russell gives.
‘and the assumption that an agent can compute a farsighted optimal policy)’
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)
The human-off-button doesn’t help Russell’s argument with respect to the weakness under discussion.
It’s the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn’t be described as instrumental convergence.
Russell’s argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.
Isnasene’s point above is closer to that, but that’s not the argument that Russell gives.
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
The work is now public.
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)