I wonder if we can rescue Eliezer’s argument. Informally (as far as I understand it) Eliezer’s argument is that if an agent is the result of some optimization process, that optimization process will tend to notice and fix any incoherent behavior in the agent because that behavior will likely cause the agent to do something that counts as a clear loss from the optimization process’s perspective.
So instead of letting O be either world states or world trajectories, make it the set of all possible combinations of properties of world trajectories that optimization processes in our world might care about. Formally we can define this as a partition of all possible world trajectories into mutually exclusive subsets where two trajectories are in the same subset iff no optimization process in our light-cone is likely to distinguish between them in any way. (BTW I believe it’s standard or at least not unusual in decision theory to think of O as coarse-grained outcomes that people might care about, rather than micro states or micro trajectories.)
Now Rohin’s objection no longer applies because we can’t always find “a utility function which assigns maximal utility to all and only the world-trajectories in which those choices were made”. Consider an agent that twitches according to some random sequence R. Since no optimization process in our world is likely to care that an agent twitches exactly according to R, any element of O that contains a trajectory where the agent twitches according to R would also contain a trajectory where the agent twitches according to some other sequence R’, so there is no utility function which assigns maximal utility to all and only the world-trajectories in which the agent twitches according to R.
Having (hopefully) formalized the argument in a way that is no longer vacuous, I have to say I’m not entirely sure what the larger point of it is. Rohin seems to think the point is “Simply knowing that an agent is intelligent lets us infer that it is goal-directed” but Eliezer doesn’t seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That’s actually one of MIRI’s research objectives even though they take a different approach from Paul’s.) Can anyone link to places where Eliezer uses this argument as part of some larger argument?
My interpretation of Eliezer’s position is something like: “If you see an intelligent agent that wasn’t specifically optimized away from goal-directedness, then it will be goal-directed”. I think I could restate arguments for that, though I remember reading about a bunch of stuff related to this on Arbital, so maybe one of the writeups there gives more background.
You could argue that while [building AIs with really weird utility functions] is possible in principle, no one would ever build such an agent. I wholeheartedly agree, but note that this is now an argument based on particular empirical facts about humans (or perhaps agent-building processes more generally).
And if you’re going to argue based on particular empirical facts about what goals we expect, then I don’t think that doing so via coherence arguments helps very much.
And if you’re going to argue based on particular empirical facts about what goals we expect, then I don’t think that doing so via coherence arguments helps very much.
I note that the first sentence of your post is “Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function.” so it seems worth pointing out that there’s a reasonable way to interpret “sufficiently optimised agents appear coherent” which isn’t subject to that criticism.
Beyond that, as I mentioned, it’s not clear to me what Eliezer was arguing for. (It seems plausible that he considered “sufficiently optimised agents appear coherent”, or the immediate corollary that such agents can be viewed as approximate EU maximizers with utility functions over the O that I defined, interesting in itself as a possibly surprising prediction that we can make about such agents.) What larger conclusion do you think he was arguing for, and why (preferably with citations)? Once we settle that, maybe then we can discuss whether his argumentative strategy was a good one?
Rohin seems to think the point is “Simply knowing that an agent is intelligent lets us infer that it is goal-directed” but Eliezer doesn’t seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That’s actually one of MIRI’s research objectives even though they take a different approach from Paul’s.)
I think the point (from Eliezer’s perspective) is “Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer”. The main implication is that there is no way to affect the details of a superintelligent AI except by affecting its utility function, since everything else is fixed by math (specifically the VNM theorem). Note that this is (or rather, appears to be) a very strong condition on what alignment approaches could possibly work—you can throw out any approach that isn’t going to affect the AI’s utility function. I think this is the primary reason for Eliezer making this argument. Let’s call this the “intelligence implies EU maximization” claim.
Separately, there is another claim that says “EU maximization by default implies goal-directedness” (or the presence of convergent instrumental subgoals, if you prefer that instead of goal-directedness). However, this is not required by math, so it is possible to avoid this implication, by designing your utility function in just the right way.
Corrigibility is possible under this framework by working against the second claim, i.e. designing the utility function in just the right way that you get corrigible behavior out. And in fact this is the approach to corrigibility that MIRI looked into.
I am primarily taking issue with the “intelligence implies EU maximization” argument. The problem is, “intelligence implies EU maximization” is true, it just happens to be vacuous. So I can’t say that that’s what I’m arguing against. This is why I rounded it off to arguing against “intelligence implies goal-directedness”, though this is clearly a bad enough summary that I shouldn’t be saying that any more.
I think the point (from Eliezer’s perspective) is “Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer”.
A cognitively powerful agent might not be sufficiently optimized
Scenarios that negate “Relevant powerful agents will be highly optimized”, such as brute forcing non-recursive intelligence, can potentially evade the ‘sufficiently optimized’ condition required to yield predicted coherence. E.g., it might be possible to create a cognitively powerful system by overdriving some fixed set of algorithms, and then to prevent this system from optimizing itself or creating offspring agents in the environment. This could allow the creation of a cognitively powerful system that does not appear to us as a bounded Bayesian. (If, for some reason, that was a good idea.)
In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not “highly optimized” and hence not an expected utility maximizer.
In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between “intelligent” (or “cognitively powerful”) and “highly optimized”.
In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between “intelligent” (or “cognitively powerful”) and “highly optimized”.
That’s true, I’m not sure what this distinction is meant to capture. I’m updating that the thing I said is less likely to be true, but I’m still somewhat confident that it captures the general gist of what Eliezer meant. I would bet on this at even odds if there were some way to evaluate it.
In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not “highly optimized” and hence not an expected utility maximizer.
This is a tiny bit of his writing, and his tone makes it clear that this is unlikely. This is different from what I expected (when something has the force of a theorem you don’t usually call its negation just “unlikely” and have a story for how it could be true), but it still seems consistent with the general story I said above.
In any case, I don’t want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I’m arguing against, which I thought Eliezer believed, but even if he doesn’t it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.
In any case, I don’t want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I’m arguing against, which I thought Eliezer believed, but even if he doesn’t it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.
It seems fine to debunk what you think is a common implicit belief in the rationalist AI safety crowd, but I think it’s important to be fair to other researchers and not attribute errors to them when you don’t know or aren’t sure that they actually committed such errors. For people who aren’t domain experts (which is most people), reputation is highly important for them to evaluate claims in a technical field like AI safety, so we should take care not to misinform them about, for example, how often someone makes technical errors.
I’m pretty sure I have never mentioned Eliezer in the Value Learning sequence. I linked to his writings because they’re the best explanation of the perspective I’m arguing against. (Note that this is different from claiming that Eliezer believes that perspective.) This post and comment thread attributed the argument and belief to Eliezer, not me. I responded because it was specifically about what I was arguing against in my post, and I didn’t say “I am clarifying the particular argument I am arguing against and am unsure what Eliezer’s actual position is” because a) I did think that it was Eliezer’s actual position, b) this is a ridiculous amount of boilerplate and c) I try not to spend too much time on comments.
I’m not feeling particularly open to feedback currently, because honestly I think I take far more care about this sort of issue than the typical researcher, but if you want to list a specific thing I could have done differently, I might try to consider how to do that sort of thing in the future.
Just a note that in the link that Wei Dai provides for “Relevant powerful agents will be highly optimized”, Eliezer explicitly assigns ’75%′ to ‘The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.’
even if he doesn’t it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.
Just a note that in the link that Wei Dai provides for “Relevant powerful agents will be highly optimized”, Eliezer explicitly assigns ’75%′ to ‘The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.’
Yeah, it’s worth noting that I don’t understand what this means. By my intuitive read of the statement, I’d have given it 95+% of being true, in the sense that you aren’t going to randomly stumble upon a powerful agent. But also by my intuitive read, the negative example given on that page would be a positive example:
An example of a scenario that negates RelevantPowerfulAgentsHighlyOptimized is KnownAlgorithmNonrecursiveIntelligence, where a cognitively powerful intelligence is produced by pouring lots of computing power into known algorithms, and this intelligence is then somehow prohibited from self-modification and the creation of environmental subagents.
On my view, known algorithms are already very optimized? E.g. Dijkstra’s algorithm is highly optimized for efficient computation of shortest paths.
So TL;DR idk what optimized is supposed to mean here.
I wonder if we can rescue Eliezer’s argument. Informally (as far as I understand it) Eliezer’s argument is that if an agent is the result of some optimization process, that optimization process will tend to notice and fix any incoherent behavior in the agent because that behavior will likely cause the agent to do something that counts as a clear loss from the optimization process’s perspective.
So instead of letting O be either world states or world trajectories, make it the set of all possible combinations of properties of world trajectories that optimization processes in our world might care about. Formally we can define this as a partition of all possible world trajectories into mutually exclusive subsets where two trajectories are in the same subset iff no optimization process in our light-cone is likely to distinguish between them in any way. (BTW I believe it’s standard or at least not unusual in decision theory to think of O as coarse-grained outcomes that people might care about, rather than micro states or micro trajectories.)
Now Rohin’s objection no longer applies because we can’t always find “a utility function which assigns maximal utility to all and only the world-trajectories in which those choices were made”. Consider an agent that twitches according to some random sequence R. Since no optimization process in our world is likely to care that an agent twitches exactly according to R, any element of O that contains a trajectory where the agent twitches according to R would also contain a trajectory where the agent twitches according to some other sequence R’, so there is no utility function which assigns maximal utility to all and only the world-trajectories in which the agent twitches according to R.
Having (hopefully) formalized the argument in a way that is no longer vacuous, I have to say I’m not entirely sure what the larger point of it is. Rohin seems to think the point is “Simply knowing that an agent is intelligent lets us infer that it is goal-directed” but Eliezer doesn’t seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That’s actually one of MIRI’s research objectives even though they take a different approach from Paul’s.) Can anyone link to places where Eliezer uses this argument as part of some larger argument?
My interpretation of Eliezer’s position is something like: “If you see an intelligent agent that wasn’t specifically optimized away from goal-directedness, then it will be goal-directed”. I think I could restate arguments for that, though I remember reading about a bunch of stuff related to this on Arbital, so maybe one of the writeups there gives more background.
Here’s an example of Eliezer using the argument: AI Alignment: Why It’s Hard, and Where to Start
From Rohin’s post, a quote which I also endorse:
And if you’re going to argue based on particular empirical facts about what goals we expect, then I don’t think that doing so via coherence arguments helps very much.
I note that the first sentence of your post is “Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function.” so it seems worth pointing out that there’s a reasonable way to interpret “sufficiently optimised agents appear coherent” which isn’t subject to that criticism.
Beyond that, as I mentioned, it’s not clear to me what Eliezer was arguing for. (It seems plausible that he considered “sufficiently optimised agents appear coherent”, or the immediate corollary that such agents can be viewed as approximate EU maximizers with utility functions over the O that I defined, interesting in itself as a possibly surprising prediction that we can make about such agents.) What larger conclusion do you think he was arguing for, and why (preferably with citations)? Once we settle that, maybe then we can discuss whether his argumentative strategy was a good one?
I think the point (from Eliezer’s perspective) is “Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer”. The main implication is that there is no way to affect the details of a superintelligent AI except by affecting its utility function, since everything else is fixed by math (specifically the VNM theorem). Note that this is (or rather, appears to be) a very strong condition on what alignment approaches could possibly work—you can throw out any approach that isn’t going to affect the AI’s utility function. I think this is the primary reason for Eliezer making this argument. Let’s call this the “intelligence implies EU maximization” claim.
Separately, there is another claim that says “EU maximization by default implies goal-directedness” (or the presence of convergent instrumental subgoals, if you prefer that instead of goal-directedness). However, this is not required by math, so it is possible to avoid this implication, by designing your utility function in just the right way.
Corrigibility is possible under this framework by working against the second claim, i.e. designing the utility function in just the right way that you get corrigible behavior out. And in fact this is the approach to corrigibility that MIRI looked into.
I am primarily taking issue with the “intelligence implies EU maximization” argument. The problem is, “intelligence implies EU maximization” is true, it just happens to be vacuous. So I can’t say that that’s what I’m arguing against. This is why I rounded it off to arguing against “intelligence implies goal-directedness”, though this is clearly a bad enough summary that I shouldn’t be saying that any more.
Eliezer explicitly disclaimed this:
In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not “highly optimized” and hence not an expected utility maximizer.
In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between “intelligent” (or “cognitively powerful”) and “highly optimized”.
That’s true, I’m not sure what this distinction is meant to capture. I’m updating that the thing I said is less likely to be true, but I’m still somewhat confident that it captures the general gist of what Eliezer meant. I would bet on this at even odds if there were some way to evaluate it.
This is a tiny bit of his writing, and his tone makes it clear that this is unlikely. This is different from what I expected (when something has the force of a theorem you don’t usually call its negation just “unlikely” and have a story for how it could be true), but it still seems consistent with the general story I said above.
In any case, I don’t want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I’m arguing against, which I thought Eliezer believed, but even if he doesn’t it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.
It seems fine to debunk what you think is a common implicit belief in the rationalist AI safety crowd, but I think it’s important to be fair to other researchers and not attribute errors to them when you don’t know or aren’t sure that they actually committed such errors. For people who aren’t domain experts (which is most people), reputation is highly important for them to evaluate claims in a technical field like AI safety, so we should take care not to misinform them about, for example, how often someone makes technical errors.
I’m pretty sure I have never mentioned Eliezer in the Value Learning sequence. I linked to his writings because they’re the best explanation of the perspective I’m arguing against. (Note that this is different from claiming that Eliezer believes that perspective.) This post and comment thread attributed the argument and belief to Eliezer, not me. I responded because it was specifically about what I was arguing against in my post, and I didn’t say “I am clarifying the particular argument I am arguing against and am unsure what Eliezer’s actual position is” because a) I did think that it was Eliezer’s actual position, b) this is a ridiculous amount of boilerplate and c) I try not to spend too much time on comments.
I’m not feeling particularly open to feedback currently, because honestly I think I take far more care about this sort of issue than the typical researcher, but if you want to list a specific thing I could have done differently, I might try to consider how to do that sort of thing in the future.
Just a note that in the link that Wei Dai provides for “Relevant powerful agents will be highly optimized”, Eliezer explicitly assigns ’75%′ to ‘The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.’
Agreed.
Yeah, it’s worth noting that I don’t understand what this means. By my intuitive read of the statement, I’d have given it 95+% of being true, in the sense that you aren’t going to randomly stumble upon a powerful agent. But also by my intuitive read, the negative example given on that page would be a positive example:
On my view, known algorithms are already very optimized? E.g. Dijkstra’s algorithm is highly optimized for efficient computation of shortest paths.
So TL;DR idk what optimized is supposed to mean here.