It’s still going to act instrumentally convergently within the MDP it thinks it’s in. If you’re assuming it thinks it’s in a different MDP that can’t possibly model the real world, or if it is in the real world but has an empty action set, then you’re right—it won’t become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that’s farsighted, over a state space that is “close enough” to representing the real world, then it does take over.
The thing that’s fuzzy here is “agent acting in the real world”. In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the “won a Go game” signal. I don’t know if I buy that yet, but you’re correct that there’s some kind of fuzzy boundary here. If we knew what exactly it took to get a “sufficiently good model”, we’d probably be a lot closer to AGI.
But Russell’s original argument assumes the relevant factors are within the model.
If, in that MDP, there is another “human” who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee.
I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.
The human-off-button doesn’t help Russell’s argument with respect to the weakness under discussion.
It’s the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn’t be described as instrumental convergence.
Russell’s argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.
Isnasene’s point above is closer to that, but that’s not the argument that Russell gives.
‘and the assumption that an agent can compute a farsighted optimal policy)’
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)
It’s still going to act instrumentally convergently within the MDP it thinks it’s in. If you’re assuming it thinks it’s in a different MDP that can’t possibly model the real world, or if it is in the real world but has an empty action set, then you’re right—it won’t become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that’s farsighted, over a state space that is “close enough” to representing the real world, then it does take over.
The thing that’s fuzzy here is “agent acting in the real world”. In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the “won a Go game” signal. I don’t know if I buy that yet, but you’re correct that there’s some kind of fuzzy boundary here. If we knew what exactly it took to get a “sufficiently good model”, we’d probably be a lot closer to AGI.
But Russell’s original argument assumes the relevant factors are within the model.
I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.
The human-off-button doesn’t help Russell’s argument with respect to the weakness under discussion.
It’s the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn’t be described as instrumental convergence.
Russell’s argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.
Isnasene’s point above is closer to that, but that’s not the argument that Russell gives.
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
The work is now public.
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)