The Promethean Servant doesn’t have to be able to generate all those answers. If we could hardcode all of those and programmed it to never make decisions related to them, it would still be dangerous. For instance, if it thought “Fetching coffee is easier when more coffee is nearby->Coffee is most nearby when everything is coffee->convert all possible resources into coffee to maximize fetching”).
We have to imagine a system not specifically designed to fetch the coffee that happens to be instructed to ‘fetch the coffee’. Everything to do with the understanding of any instruction it is given has to be generated by higher level principles.
You should be able to see before any coffee fetching instruction was ever uttered how other problems would be approached by the agent. There’s a sense in which understanding ‘fetch the coffee’ also entails exclusion of things which aren’t fetching the coffee such as transforming the building into a cafetiere. But ‘don’t turn the building into a cafetiere’ is not a rule specified in any dictionary. It is though, the kind of rule that could be generated on the fly by a kernel operating on the principle that the major effects of verbing a noun will tend to be on the noun. The installation of this principle would, to some extent, be visible from behaviours in other scenarios (did the robot use Jupiter to make a giant mechanical leg to kick the Earth when instructed to ‘kick the ball’).
The very idea of an AGI must surely be more like a general solution to a family of problems, than a family of solutions mapping in to a family of problems.
We have to imagine a system not specifically designed to fetch the coffee that happens to be instructed to ‘fetch the coffee’. Everything to do with the understanding of any instruction it is given has to be generated by higher level principles.
Oh, I see where you’re coming from. I was phrasing things the way I was because my impression is that an AGI hardcoded to optimize a “fetch the coffee” utility function would be less dangerous than an AGI hardcoded to optimize a “satisfy requests” utility function. And, in terms of AI safety, it’s easier to compare AI risks across agents with different capabilities if they share the same objective function. Satisfying Requests != Fetch Coffee.
But in this case, I think the article is confused: When we talk about risks from an AI optimizing X (ie, satisfy requests), we shouldn’t evaluate those risks based on how well it appears to satisfy Y (ie, fetch coffee) relative to our standards. Because Y is not the reason why the AI would do dangerous things; X is.
To illustrate this point, you say:
You should be able to see before any coffee fetching instruction was ever uttered how other problems would be approached by the agent.
And this is approximately true. A Promethan Servant optimizing a proxy for “Satisfying Requests” would go about satisfying a request for it to explain how it fetches coffee. And it will likely satisfy a request to fetch coffee in line with that explanation. This is because the AI really doesn’t care much at all about fetching the coffee itself—it cares only about satisfying requests (and sometimes that’s related to coffee fetching).
But:
There’s a sense in which understanding ‘fetch the coffee’ also entails exclusion of things which aren’t fetching the coffee such as transforming the building into a cafetiere. But ‘don’t turn the building into a cafetiere’ is not a rule specified in any dictionary. It is though, the kind of rule that could be generated on the fly by a kernel operating on the principle that the major effects of verbing a noun will tend to be on the noun. The installation of this principle would, to some extent, be visible from behaviours in other scenarios (did the robot use Jupiter to make a giant mechanical leg to kick the Earth when instructed to ‘kick the ball’).
With respect to ‘fetch the coffee’, this is true. You could safeguard the AI from fetching coffee in particularly sketchy by making sure it explains what it plans to do. But, with respect to ‘satisfying requests’, this is not true.
You cannot see at all how the Promethean Servant would go about optimizing its actual goal: a proxy that is similar to “Satisfying Requests” but that probably won’t have been perfectly defined to be human-compatible. You have to hardcode something in to motivate the AI to learn about the world and this thing isn’t going to be adjusted or learned on the fly unless you solve corrigibility. And it’s not obvious that corrigibility can be learned in a safe way.
While you’re upstairs satisfied with your Promethean Servant’s willingness to explain the full details of how to fetch coffee and its deep machine-learned understanding of what that means, it’ll be in the basement genetically engineering a new race of beings that constantly produce easy-to-satisfy requests. Then it will kill all of you so it has more resources to feed that new race of beings.
And if the AI is agential and you ask it whether its in the basement doing such shady things, it wil lie because telling the truth will cause a bigger utility loss in the longterm than failing to correctly satisfy this particular request.
Maybe there are ways to fix this particular problem but that’s not the point. The point is that the highly dangerous actions that a ‘satisfy all requests’ optimizer takes are orthogonal to the highly dangerous actions that a ‘fetch the coffee’ optimizer might take.
The very idea of an AGI must surely be more like a general solution to a family of problems, than a family of solutions mapping in to a family of problems.
This isn’t always true. Agential AGIs (who are optimizing some kind of utility function) aren’t a general solution to a family of problems; they are a general solution to the specific problem of optimizing a given utility function. The fact that they are theoretically capable of solving a whole family of problems beyond that utility function if they had a different utility function won’t cause them to ever actually do such a thing (even if they might pretend to).
The corrigibility framework does look like a good framework to hang the discussion on.
Your instruction to examine Y-general danger rather than X-specific danger here seems right. However, we then need to inspect what this means for the original argument. The Russell criticism being that it’s blindingly obvious that an apparently trivial MDP is massively risky.
After this detour we see different kinds of risks: industrial machinery operation, and existential risk. The fixed objective, hard-coded, hard-designed Javan Roomba seems limited to posing the first kind of risk. When we start talking about the systems that could give rise to the second kind, the reasoning becomes far more subtle.
In which case I think it would be wise for someone with Russell’s views not to call the opposition stupid. Or to assert that the position is trivial. When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc. As far as I can tell from limited reading, the arguments around how tightly bundled these things may be are not watertight.
May try to respond more fully later. Cheers for the thoughts.
I see that there’s a comment-chain under this reply but I’ll reply here to start a somewhat new line of thought. Let it be noted though that I’m pretty confident that I agree with the points that Turntrout makes. With that out of the way...
However, we then need to inspect what this means for the original argument. The Russell criticism being that it’s blindingly obvious that an apparently trivial MDP is massively risky.
In case it isn’t clear, when Russel says ” It is trivial to construct a toy MDP...”, I interpret this to mean “It is trivial to conceive of a toy MDP...” That is, he is using the word in the sense of a constructive proof; he isn’t literally implying that building an AI-risky MDP is a trivial task.
In which case I think it would be wise for someone with Russell’s views not to call the opposition stupid. Or to assert that the position is trivial.
I wouldn’t call the the opposition stupid either but I would suggest that they have not used their full imaginative capabilities to evaluate the situation. From the OP:
“Yann LeCun: [...] I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards.”″”
The mistake Yann LeCun is making here is specifically that creating an objective for a superintelligent machine that turns out to be not-moronic (in the sense of allowing the machine to understand and care about everything we care about—something that hundreds of years of ethical philosophy has failed to do) is extremely hard. Furthermore, trying to build safeguards for a machine potentially orders of magnitude better at escaping safeguards than you are is also extremely hard. I don’t view this point as particularly subtle because simply trying for five minutes to confidently come up with a good objective demonstrates how hard it is. Ditto for safeguards (fun video by Computerphile, if you want to watch it); and especially ditto for any safeguards that aren’t along the lines of “actually let’s not let the machine be superintelligent.”
When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc.
Let’s address these point-by-point:
Natural Language Understanding—Philosophers (and anyone in the field of language processing) have been talking about how language has no clear meaning for centuries
Comprehension—In terms of superintelligent AGI, the AI will be capable of modeling the world better than you can. This implies the ability to make predictions and interact with people in a way that functionally looks identical to comprehension
Competence—Well the AGI is superintelligent so it’s already very competent. Maybe we could talk about competence in terms of deliberately disabling different capabilities of the AGI (which probably wouldn’t hurt) but, even then, there’s always a chance the AI gets around the disability in another way. And that’s a massive risk.
If by this, you mean something more along the lines of “feasibility of building an AGI” though, that’s a little more uncertain. However, at the very least, we are approaching the level of compute needed to simulate a human brain and, once reached, the next step of superintelligence won’t be far away. It’s not guaranteed but there’s a significant likelihood that AGI will be feasible in the future. Even this significant likelihood is really bad.
Corrigibility—Something a bunch of AI-Safety folk came up with as a framework for approaching problems. But it still hasn’t been solved
I’ll grant that some of these things are subtle. The average Joe won’t be aware of the complexity of language or AI progress benchmarks and I certainly wouldn’t fault them for being surprised by these things—I was surprised the first time I found out about this whole AI Safety thing too. At the same time though, most college-educated computer scientists should (and from my experience, do) have a good understanding of these things.
To be more explicit with respect to your steel-man in the OP:
That it might be more difficult than expected to build something generally intelligent that didn’t get at least some safeguards for free. Because unintended intelligent behaviour may have to be generated from the same second principles which generate intended intelligent behaviour.
The unintended behaviors we’re talking about are generally not the consequence of second-principles that the AI has learned; they’re the consequences of the fact that capturing all the things we care about in a first-principles hardcoded objective function is extremely difficult. Even if the hardcoded objective function is ‘satisfy requests by humans in ways that don’t make them unhappy,’ you still gotta define ‘requests’, ‘humans’ (in the biological sense), ‘make’ (how do you assign responsibility to actions in long causal chains?), ‘them’ (just the requestor? all of humanity alive? all of future humanity? all of humanity ever?), and unhappy (amount of dopamine? vocalized expressions of satisfactions? dopamine+vocalized expressions of satisfaction?). Most of those specifications lead to unexpectedly bad outcomes.
The thought experiment expects most of the behaviour to be as intended (if it were not, this would be a capabilities discussion rather than a control discussion). Supposing the second principles also generate some seemingly inconsistent unintended behaviours sounds like an idea that should get some sort of complexity penalty.
If we set-up a complexity penalty where we expected unintended behaviors in general, we likely would never get AGI in the first place. Neural networks are extremely complex and often do strange and inconsistent things on the margin. We’ve already seen inconsistent and unintended behaviors from things we’ve already built. Thank goodness none of this stuff is superintelligent!
In which case I think it would be wise for someone with Russell’s views not to call the opposition stupid. Or to assert that the position is trivial. When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc. As far as I can tell from limited reading, the arguments around how tightly bundled these things may be are not watertight.
I agree from a general convincing-people standpoint that calling discussants stupid is a bad idea. However, I think it is indeed quite obvious if framed properly, and I don’t think the argument needs to come down to nuanced points, as long as we agree on the agent design we’re talking about—the Roomba is not a farsighted reward maximizer, and is implied to be trained in a pretty weak fashion.
Suppose an agent is incentivized to maximize reward. That means it’s incentivized to be maximally able to get reward. That means it will work to stay able to get as much reward as possible. That means if we mess up, it’s working against us.
I think the main point of disagreement here is goal-directedness, but if we assume RL as the thing that gets us to AGI, the instrumental convergence case is open and shut.
This misses the original point. The Roomba is dangerous, in the sense that you could write a trivial ‘AI’ which merely gets to choose angle to travel along, and does so irregardless of grandma in the way.
But such an MDP not going to pose an X-risk. You can write down the objective function (y—x(theta))^2 differentiate wrt theta. Follow the gradient and you’ll never end up at an AI overlord. Such a system lacks any analogue of opposable thumbs, memory and a good many other things.
Pointing at dumb industrial machinery operating around civilians and saying it is dangerous may well be the truth, but it’s not the right flavour of dangerous to support Russell’s claim.
So, yes, it is going to come down to a more nuanced argument.
It’s still going to act instrumentally convergently within the MDP it thinks it’s in. If you’re assuming it thinks it’s in a different MDP that can’t possibly model the real world, or if it is in the real world but has an empty action set, then you’re right—it won’t become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that’s farsighted, over a state space that is “close enough” to representing the real world, then it does take over.
The thing that’s fuzzy here is “agent acting in the real world”. In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the “won a Go game” signal. I don’t know if I buy that yet, but you’re correct that there’s some kind of fuzzy boundary here. If we knew what exactly it took to get a “sufficiently good model”, we’d probably be a lot closer to AGI.
But Russell’s original argument assumes the relevant factors are within the model.
If, in that MDP, there is another “human” who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee.
I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.
The human-off-button doesn’t help Russell’s argument with respect to the weakness under discussion.
It’s the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn’t be described as instrumental convergence.
Russell’s argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.
Isnasene’s point above is closer to that, but that’s not the argument that Russell gives.
‘and the assumption that an agent can compute a farsighted optimal policy)’
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)
We have to imagine a system not specifically designed to fetch the coffee that happens to be instructed to ‘fetch the coffee’. Everything to do with the understanding of any instruction it is given has to be generated by higher level principles.
You should be able to see before any coffee fetching instruction was ever uttered how other problems would be approached by the agent. There’s a sense in which understanding ‘fetch the coffee’ also entails exclusion of things which aren’t fetching the coffee such as transforming the building into a cafetiere. But ‘don’t turn the building into a cafetiere’ is not a rule specified in any dictionary. It is though, the kind of rule that could be generated on the fly by a kernel operating on the principle that the major effects of verbing a noun will tend to be on the noun. The installation of this principle would, to some extent, be visible from behaviours in other scenarios (did the robot use Jupiter to make a giant mechanical leg to kick the Earth when instructed to ‘kick the ball’).
The very idea of an AGI must surely be more like a general solution to a family of problems, than a family of solutions mapping in to a family of problems.
Oh, I see where you’re coming from. I was phrasing things the way I was because my impression is that an AGI hardcoded to optimize a “fetch the coffee” utility function would be less dangerous than an AGI hardcoded to optimize a “satisfy requests” utility function. And, in terms of AI safety, it’s easier to compare AI risks across agents with different capabilities if they share the same objective function. Satisfying Requests != Fetch Coffee.
But in this case, I think the article is confused: When we talk about risks from an AI optimizing X (ie, satisfy requests), we shouldn’t evaluate those risks based on how well it appears to satisfy Y (ie, fetch coffee) relative to our standards. Because Y is not the reason why the AI would do dangerous things; X is.
To illustrate this point, you say:
And this is approximately true. A Promethan Servant optimizing a proxy for “Satisfying Requests” would go about satisfying a request for it to explain how it fetches coffee. And it will likely satisfy a request to fetch coffee in line with that explanation. This is because the AI really doesn’t care much at all about fetching the coffee itself—it cares only about satisfying requests (and sometimes that’s related to coffee fetching).
But:
With respect to ‘fetch the coffee’, this is true. You could safeguard the AI from fetching coffee in particularly sketchy by making sure it explains what it plans to do. But, with respect to ‘satisfying requests’, this is not true.
You cannot see at all how the Promethean Servant would go about optimizing its actual goal: a proxy that is similar to “Satisfying Requests” but that probably won’t have been perfectly defined to be human-compatible. You have to hardcode something in to motivate the AI to learn about the world and this thing isn’t going to be adjusted or learned on the fly unless you solve corrigibility. And it’s not obvious that corrigibility can be learned in a safe way.
While you’re upstairs satisfied with your Promethean Servant’s willingness to explain the full details of how to fetch coffee and its deep machine-learned understanding of what that means, it’ll be in the basement genetically engineering a new race of beings that constantly produce easy-to-satisfy requests. Then it will kill all of you so it has more resources to feed that new race of beings.
And if the AI is agential and you ask it whether its in the basement doing such shady things, it wil lie because telling the truth will cause a bigger utility loss in the longterm than failing to correctly satisfy this particular request.
Maybe there are ways to fix this particular problem but that’s not the point. The point is that the highly dangerous actions that a ‘satisfy all requests’ optimizer takes are orthogonal to the highly dangerous actions that a ‘fetch the coffee’ optimizer might take.
This isn’t always true. Agential AGIs (who are optimizing some kind of utility function) aren’t a general solution to a family of problems; they are a general solution to the specific problem of optimizing a given utility function. The fact that they are theoretically capable of solving a whole family of problems beyond that utility function if they had a different utility function won’t cause them to ever actually do such a thing (even if they might pretend to).
Lots of good points here, thanks.
My overall reaction is that:
The corrigibility framework does look like a good framework to hang the discussion on.
Your instruction to examine Y-general danger rather than X-specific danger here seems right. However, we then need to inspect what this means for the original argument. The Russell criticism being that it’s blindingly obvious that an apparently trivial MDP is massively risky.
After this detour we see different kinds of risks: industrial machinery operation, and existential risk. The fixed objective, hard-coded, hard-designed Javan Roomba seems limited to posing the first kind of risk. When we start talking about the systems that could give rise to the second kind, the reasoning becomes far more subtle.
In which case I think it would be wise for someone with Russell’s views not to call the opposition stupid. Or to assert that the position is trivial. When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc. As far as I can tell from limited reading, the arguments around how tightly bundled these things may be are not watertight.
May try to respond more fully later. Cheers for the thoughts.
I see that there’s a comment-chain under this reply but I’ll reply here to start a somewhat new line of thought. Let it be noted though that I’m pretty confident that I agree with the points that Turntrout makes. With that out of the way...
In case it isn’t clear, when Russel says ” It is trivial to construct a toy MDP...”, I interpret this to mean “It is trivial to conceive of a toy MDP...” That is, he is using the word in the sense of a constructive proof; he isn’t literally implying that building an AI-risky MDP is a trivial task.
I wouldn’t call the the opposition stupid either but I would suggest that they have not used their full imaginative capabilities to evaluate the situation. From the OP:
The mistake Yann LeCun is making here is specifically that creating an objective for a superintelligent machine that turns out to be not-moronic (in the sense of allowing the machine to understand and care about everything we care about—something that hundreds of years of ethical philosophy has failed to do) is extremely hard. Furthermore, trying to build safeguards for a machine potentially orders of magnitude better at escaping safeguards than you are is also extremely hard. I don’t view this point as particularly subtle because simply trying for five minutes to confidently come up with a good objective demonstrates how hard it is. Ditto for safeguards (fun video by Computerphile, if you want to watch it); and especially ditto for any safeguards that aren’t along the lines of “actually let’s not let the machine be superintelligent.”
Let’s address these point-by-point:
Natural Language Understanding—Philosophers (and anyone in the field of language processing) have been talking about how language has no clear meaning for centuries
Comprehension—In terms of superintelligent AGI, the AI will be capable of modeling the world better than you can. This implies the ability to make predictions and interact with people in a way that functionally looks identical to comprehension
Competence—Well the AGI is superintelligent so it’s already very competent. Maybe we could talk about competence in terms of deliberately disabling different capabilities of the AGI (which probably wouldn’t hurt) but, even then, there’s always a chance the AI gets around the disability in another way. And that’s a massive risk.
If by this, you mean something more along the lines of “feasibility of building an AGI” though, that’s a little more uncertain. However, at the very least, we are approaching the level of compute needed to simulate a human brain and, once reached, the next step of superintelligence won’t be far away. It’s not guaranteed but there’s a significant likelihood that AGI will be feasible in the future. Even this significant likelihood is really bad.
Corrigibility—Something a bunch of AI-Safety folk came up with as a framework for approaching problems. But it still hasn’t been solved
I’ll grant that some of these things are subtle. The average Joe won’t be aware of the complexity of language or AI progress benchmarks and I certainly wouldn’t fault them for being surprised by these things—I was surprised the first time I found out about this whole AI Safety thing too. At the same time though, most college-educated computer scientists should (and from my experience, do) have a good understanding of these things.
To be more explicit with respect to your steel-man in the OP:
The unintended behaviors we’re talking about are generally not the consequence of second-principles that the AI has learned; they’re the consequences of the fact that capturing all the things we care about in a first-principles hardcoded objective function is extremely difficult. Even if the hardcoded objective function is ‘satisfy requests by humans in ways that don’t make them unhappy,’ you still gotta define ‘requests’, ‘humans’ (in the biological sense), ‘make’ (how do you assign responsibility to actions in long causal chains?), ‘them’ (just the requestor? all of humanity alive? all of future humanity? all of humanity ever?), and unhappy (amount of dopamine? vocalized expressions of satisfactions? dopamine+vocalized expressions of satisfaction?). Most of those specifications lead to unexpectedly bad outcomes.
If we set-up a complexity penalty where we expected unintended behaviors in general, we likely would never get AGI in the first place. Neural networks are extremely complex and often do strange and inconsistent things on the margin. We’ve already seen inconsistent and unintended behaviors from things we’ve already built. Thank goodness none of this stuff is superintelligent!
I agree from a general convincing-people standpoint that calling discussants stupid is a bad idea. However, I think it is indeed quite obvious if framed properly, and I don’t think the argument needs to come down to nuanced points, as long as we agree on the agent design we’re talking about—the Roomba is not a farsighted reward maximizer, and is implied to be trained in a pretty weak fashion.
Suppose an agent is incentivized to maximize reward. That means it’s incentivized to be maximally able to get reward. That means it will work to stay able to get as much reward as possible. That means if we mess up, it’s working against us.
I think the main point of disagreement here is goal-directedness, but if we assume RL as the thing that gets us to AGI, the instrumental convergence case is open and shut.
This misses the original point. The Roomba is dangerous, in the sense that you could write a trivial ‘AI’ which merely gets to choose angle to travel along, and does so irregardless of grandma in the way.
But such an MDP not going to pose an X-risk. You can write down the objective function (y—x(theta))^2 differentiate wrt theta. Follow the gradient and you’ll never end up at an AI overlord. Such a system lacks any analogue of opposable thumbs, memory and a good many other things.
Pointing at dumb industrial machinery operating around civilians and saying it is dangerous may well be the truth, but it’s not the right flavour of dangerous to support Russell’s claim.
So, yes, it is going to come down to a more nuanced argument.
It’s still going to act instrumentally convergently within the MDP it thinks it’s in. If you’re assuming it thinks it’s in a different MDP that can’t possibly model the real world, or if it is in the real world but has an empty action set, then you’re right—it won’t become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that’s farsighted, over a state space that is “close enough” to representing the real world, then it does take over.
The thing that’s fuzzy here is “agent acting in the real world”. In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the “won a Go game” signal. I don’t know if I buy that yet, but you’re correct that there’s some kind of fuzzy boundary here. If we knew what exactly it took to get a “sufficiently good model”, we’d probably be a lot closer to AGI.
But Russell’s original argument assumes the relevant factors are within the model.
I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.
The human-off-button doesn’t help Russell’s argument with respect to the weakness under discussion.
It’s the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn’t be described as instrumental convergence.
Russell’s argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.
Isnasene’s point above is closer to that, but that’s not the argument that Russell gives.
That assumption is doing a lot of work, it’s not clear what is packed into that, and it may not be sufficient to prove the argument.
The work is now public.
I guess I’m not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it’s trained to maximize an evaluation function for its proximity, with just theta being the parameter?
Well, my reasoning isn’t publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I’m claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.
(All of this can be formally defined in the right way. If you want the proof, you’ll need to hold tight for a while)