Your distinction between “outer alignment” and “inner alignment” is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn’t me; and though I’ve sometimes used the terms in occasions where they seem to fit unambiguously, it’s not something I see as a clear ontological division, especially if you’re talking about questions like “If we own the following kind of blackbox, would alignment get any easier?” which on my view breaks that ontology. So I strongly reject your frame that this post was “clearly portraying an outer alignment problem” and can be criticized on those grounds by you; that is anachronistic.
You are now dragging in a very large number of further inferences about “what I meant”, and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won’t work to say what you want. This point is true! If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation. But if you’re doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won’t work to say what you want. This point is true!
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the “larger argument” (mentioned in the May 2024 update to this post) in which this point plays a role.
You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:
[...] and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect [...]
But if you’re doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.
But it seems to me that he’s already doing this. He’s not alleging that this post is incorrect in isolation.
The only reason this discussion is happened on the comments of this post at all is the May 2024 update at the start of it, which Matthew used as a jumping-off point for saying “my critique of the ‘larger argument’ does not make the mistake referred to in the May 2024 update[2], but people keep saying it does[3], so I’ll try restating that critique again in the hopes it will be clearer this time.”
I say “some version of” to allow for a distinction between (a) the “larger argument” of Eliezer_2007′s which this post was meant to support in 2007, and (b) whatever version of the same “larger argument” was a standard MIRI position as of roughly 2016-2017.
As far as I can tell, Matthew is only interested in evaluating the 2016-2017 MIRI position, not the 2007 EY position (insofar as the latter is different, if it fact it is). When he cites older EY material, he does so as a means to an end – either as indirect evidence of later MIRI positions, because it was itself cited in the later MIRI material which is his main topic.
Note that the current version of Matthew’s 2023 post includes multiple caveats that he’s not making the mistake referred to in the May 2024 update.
Note also that Matthew’s post only mentions this post in two relatively minor ways, first to clarify that he doesn’t make the mistake referred to in the update (unlike some “Non-MIRI people” who do make the mistake), and second to support an argument about whether “Yudkowsky and other MIRI people” believe that it could be sufficient to get a single human’s values into the AI, or whether something like CEV would be required instead.
I bring up the mentions of this post in Matthew’s post in order to clarifies what role “is ‘The Hidden Complexity of Wishes’ correct in isolation, considered apart from anything outside it?” plays in Matthew’s critique – namely, none at all, IIUC.
(I realize that Matthew’s post has been edited over time, so I can only speak to the current version.)
To be fully explicit: I’m not claiming anything about whether or not the May 2024 update was about Matthew’s 2023 post (alone or in combination with anything else) or not. I’m just rephrasing what Matthew said in the first comment of this thread (which was also agnostic on the topic of whether the update referred to him).
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the “larger argument” (mentioned in the May 2024 update to this post) in which this point plays a role.
I’ll confirm that I’m not saying this post’s exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I’m not saying the parable is wrong. Parables are rarely “wrong” in a strict sense, and I am not disputing this parable’s conclusion.
However, I am saying: this parable presumably played some role in the “larger” argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it’s difficult to perfectly delineate potential “good” scenarios and distinguish them from all potential “bad” scenarios. This is the problem I was analyzing in my original comment.
While the term “outer alignment” was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called “outer alignment” at the time. Because my usage of “outer alignment” was merely descriptive in this sense, I reject the idea that my comment was anachronistic.
And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like “this particular post’s conclusion is wrong”. I think the fact that the post doesn’t really have a clear thesis in the first place means that it can’t be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.
Here’s an argument that alignment is difficult which uses complexity of value as a subpoint:
A1. If you try to manually specify what you want, you fail.
A2. Therefore, you want something algorithmically complex.
B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
B2. We want to affect the values-distribution, somehow, so that it ends up with our values.
B3. We don’t understand how to affect the values-distribution toward something specific.
B4. If we don’t affect the value-distribution toward something specific, then the values-distribution probably puts large penalties for absolute algorithmic complexity; any specific utility function with higher absolute algorithmic complexity will be less likely to be the one that the AGI ends up with.
C1. Because of A2 (our values are algorithmically complex) and B4 (a complex utility function is unlikely to show up in an AGI without us skillfully intervening), an AGI is unlikely to have our values without us skillfully intervening.
C2. Because of B3 (we don’t know how to skillfully intervene on an AGI’s values) and C1, an AGI is unlikely to have our values.
I think that you think that the argument under discussion is something like:
(same) A1. If you try to manually specify what you want, you fail.
(same) A2. Therefore, you want something algorithmically complex.
(same) B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
(same) B2. We want to affect the values-distribution, somehow, so that it ends up with our values.
B′3. The greater the complexity of our values, the harder it is to point at our values.
B′4. The harder it is to point at our values, the more work or difficulty is involved in B2.
C′1. By B′3 and B′4: the greater the complexity of our values, the more work or difficulty is involved in B2 (determining the AGI’s values).
C′2. Because of A2 (our values are algorithmically complex) and C′1, it would take a lot of work to make an AGI pursue our values.
These are different arguments, which make use of the complexity of values in different ways. You dispute B′3 on the grounds that it can be easy to point at complex values. B′3 isn’t used in the first argument though.
In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.
Complexity makes things worse, yes, but the conclusion “AGI is unlikely to have our values” is already entailed by the other premises even if we drop the stuff about complexity.
Why: if we’re just sampling some function from a simplicity prior, we’re very unlikely to get any particular nontrivial function that we’ve decided to care about in advance of the sampling event. There are just too many possible functions, and probability mass has to get divided among them all.
In other words, if it takes N bits to specify human values, there are 2N ways that a bitstring of the same length could be set, and we’re hoping to land on just one of those through luck alone. (And to land on a bitstring of this specific length in the first place, of course.) Unless N is very small, such a coincidence is extremely unlikely.
And N is not going to be that small; even in the sort of naive and overly simple “hand-crafted” value specifications which EY has critiqued in this post and elsewhere, a lot of details have to be specified. (E.g. some proposals refer to “humans” and so a full algorithmic description of them would require an account of what is and isn’t a human.)
One could devise a variant of this argument that doesn’t have this issue, by “relaxing the problem” so that we have some control, just not enough to pin down the sampled function exactly. And then the remaining freedom is filled randomly with a simplicity bias. This partial control might be enough to make a simple function likely, while not being able to make a more complex function likely. (Hmm, perhaps this is just your second argument, or a version of it.)
This kind of reasoning might be applicable in a world where its premises are true, but I don’t think it’s premises are true in our world.
In practice, we apparently have no trouble getting machines to compute very complex functions, including (as Matthew points out) specifications of human value whose robustness would have seemed like impossible magic back in 2007. The main difficulty, if there is one, is in “getting the function to play the role of the AGI values,” not in getting the AGI to compute the particular function we want in the first place.
The main difficulty, if there is one, is in “getting the function to play the role of the AGI values,” not in getting the AGI to compute the particular function we want in the first place.
Right, that is the problem (and IDK of anyone discussing this who says otherwise).
Another position would be that it’s probably easy to influence a few bits of the AI’s utility function, but not others. For example, it’s conceivable that, by doing capabilities research in different ways, you could increase the probability that the AGI is highly ambitious—e.g. tries to take over the whole lightcone, tries to acausally bargain, etc., rather than being more satisficy. (IDK how to do that, but plausibly it’s qualitatively easier than alignment.) Then you could claim that it’s half a bit more likely that you’ve made an FAI, given that an FAI would probably be ambitious. In this case, it does matter that the utility function is complex.
While the term “outer alignment” wasn’t coined until later to describe the exact issue that I’m talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.
Because I was simply using “outer alignment” in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that’s all.
To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential ‘good’ scenarios and all possible ‘bad’ scenarios. This challenge has been a recurring theme in discussions of alignment, as it’s considered hard to capture every nuance of what you want in your specification without missing an edge case.
This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons.
It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.
I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me “So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post—though of course it has no bearing on the actual point that this post makes?” then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post’s narrow point. I was responding to that particular part of your comment instead.
As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn’t seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.
My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the “outer alignment” problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.
Your distinction between “outer alignment” and “inner alignment” is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn’t me; and though I’ve sometimes used the terms in occasions where they seem to fit unambiguously, it’s not something I see as a clear ontological division, especially if you’re talking about questions like “If we own the following kind of blackbox, would alignment get any easier?” which on my view breaks that ontology. So I strongly reject your frame that this post was “clearly portraying an outer alignment problem” and can be criticized on those grounds by you; that is anachronistic.
You are now dragging in a very large number of further inferences about “what I meant”, and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won’t work to say what you want. This point is true! If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation. But if you’re doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the “larger argument” (mentioned in the May 2024 update to this post) in which this point plays a role.
You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:
But it seems to me that he’s already doing this. He’s not alleging that this post is incorrect in isolation.
The only reason this discussion is happened on the comments of this post at all is the May 2024 update at the start of it, which Matthew used as a jumping-off point for saying “my critique of the ‘larger argument’ does not make the mistake referred to in the May 2024 update[2], but people keep saying it does[3], so I’ll try restating that critique again in the hopes it will be clearer this time.”
I say “some version of” to allow for a distinction between (a) the “larger argument” of Eliezer_2007′s which this post was meant to support in 2007, and (b) whatever version of the same “larger argument” was a standard MIRI position as of roughly 2016-2017.
As far as I can tell, Matthew is only interested in evaluating the 2016-2017 MIRI position, not the 2007 EY position (insofar as the latter is different, if it fact it is). When he cites older EY material, he does so as a means to an end – either as indirect evidence of later MIRI positions, because it was itself cited in the later MIRI material which is his main topic.
Note that the current version of Matthew’s 2023 post includes multiple caveats that he’s not making the mistake referred to in the May 2024 update.
Note also that Matthew’s post only mentions this post in two relatively minor ways, first to clarify that he doesn’t make the mistake referred to in the update (unlike some “Non-MIRI people” who do make the mistake), and second to support an argument about whether “Yudkowsky and other MIRI people” believe that it could be sufficient to get a single human’s values into the AI, or whether something like CEV would be required instead.
I bring up the mentions of this post in Matthew’s post in order to clarifies what role “is ‘The Hidden Complexity of Wishes’ correct in isolation, considered apart from anything outside it?” plays in Matthew’s critique – namely, none at all, IIUC.
(I realize that Matthew’s post has been edited over time, so I can only speak to the current version.)
To be fully explicit: I’m not claiming anything about whether or not the May 2024 update was about Matthew’s 2023 post (alone or in combination with anything else) or not. I’m just rephrasing what Matthew said in the first comment of this thread (which was also agnostic on the topic of whether the update referred to him).
I’ll confirm that I’m not saying this post’s exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I’m not saying the parable is wrong. Parables are rarely “wrong” in a strict sense, and I am not disputing this parable’s conclusion.
However, I am saying: this parable presumably played some role in the “larger” argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it’s difficult to perfectly delineate potential “good” scenarios and distinguish them from all potential “bad” scenarios. This is the problem I was analyzing in my original comment.
While the term “outer alignment” was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called “outer alignment” at the time. Because my usage of “outer alignment” was merely descriptive in this sense, I reject the idea that my comment was anachronistic.
And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like “this particular post’s conclusion is wrong”. I think the fact that the post doesn’t really have a clear thesis in the first place means that it can’t be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.
Here’s an argument that alignment is difficult which uses complexity of value as a subpoint:
A1. If you try to manually specify what you want, you fail.
A2. Therefore, you want something algorithmically complex.
B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
B2. We want to affect the values-distribution, somehow, so that it ends up with our values.
B3. We don’t understand how to affect the values-distribution toward something specific.
B4. If we don’t affect the value-distribution toward something specific, then the values-distribution probably puts large penalties for absolute algorithmic complexity; any specific utility function with higher absolute algorithmic complexity will be less likely to be the one that the AGI ends up with.
C1. Because of A2 (our values are algorithmically complex) and B4 (a complex utility function is unlikely to show up in an AGI without us skillfully intervening), an AGI is unlikely to have our values without us skillfully intervening.
C2. Because of B3 (we don’t know how to skillfully intervene on an AGI’s values) and C1, an AGI is unlikely to have our values.
I think that you think that the argument under discussion is something like:
(same) A1. If you try to manually specify what you want, you fail.
(same) A2. Therefore, you want something algorithmically complex.
(same) B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
(same) B2. We want to affect the values-distribution, somehow, so that it ends up with our values.
B′3. The greater the complexity of our values, the harder it is to point at our values.
B′4. The harder it is to point at our values, the more work or difficulty is involved in B2.
C′1. By B′3 and B′4: the greater the complexity of our values, the more work or difficulty is involved in B2 (determining the AGI’s values).
C′2. Because of A2 (our values are algorithmically complex) and C′1, it would take a lot of work to make an AGI pursue our values.
These are different arguments, which make use of the complexity of values in different ways. You dispute B′3 on the grounds that it can be easy to point at complex values. B′3 isn’t used in the first argument though.
In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.
Complexity makes things worse, yes, but the conclusion “AGI is unlikely to have our values” is already entailed by the other premises even if we drop the stuff about complexity.
Why: if we’re just sampling some function from a simplicity prior, we’re very unlikely to get any particular nontrivial function that we’ve decided to care about in advance of the sampling event. There are just too many possible functions, and probability mass has to get divided among them all.
In other words, if it takes N bits to specify human values, there are 2N ways that a bitstring of the same length could be set, and we’re hoping to land on just one of those through luck alone. (And to land on a bitstring of this specific length in the first place, of course.) Unless N is very small, such a coincidence is extremely unlikely.
And N is not going to be that small; even in the sort of naive and overly simple “hand-crafted” value specifications which EY has critiqued in this post and elsewhere, a lot of details have to be specified. (E.g. some proposals refer to “humans” and so a full algorithmic description of them would require an account of what is and isn’t a human.)
One could devise a variant of this argument that doesn’t have this issue, by “relaxing the problem” so that we have some control, just not enough to pin down the sampled function exactly. And then the remaining freedom is filled randomly with a simplicity bias. This partial control might be enough to make a simple function likely, while not being able to make a more complex function likely. (Hmm, perhaps this is just your second argument, or a version of it.)
This kind of reasoning might be applicable in a world where its premises are true, but I don’t think it’s premises are true in our world.
In practice, we apparently have no trouble getting machines to compute very complex functions, including (as Matthew points out) specifications of human value whose robustness would have seemed like impossible magic back in 2007. The main difficulty, if there is one, is in “getting the function to play the role of the AGI values,” not in getting the AGI to compute the particular function we want in the first place.
Right, that is the problem (and IDK of anyone discussing this who says otherwise).
Another position would be that it’s probably easy to influence a few bits of the AI’s utility function, but not others. For example, it’s conceivable that, by doing capabilities research in different ways, you could increase the probability that the AGI is highly ambitious—e.g. tries to take over the whole lightcone, tries to acausally bargain, etc., rather than being more satisficy. (IDK how to do that, but plausibly it’s qualitatively easier than alignment.) Then you could claim that it’s half a bit more likely that you’ve made an FAI, given that an FAI would probably be ambitious. In this case, it does matter that the utility function is complex.
While the term “outer alignment” wasn’t coined until later to describe the exact issue that I’m talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.
Because I was simply using “outer alignment” in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that’s all.
To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential ‘good’ scenarios and all possible ‘bad’ scenarios. This challenge has been a recurring theme in discussions of alignment, as it’s considered hard to capture every nuance of what you want in your specification without missing an edge case.
This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons.
It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.
I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me “So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post—though of course it has no bearing on the actual point that this post makes?” then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post’s narrow point. I was responding to that particular part of your comment instead.
As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn’t seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.
My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the “outer alignment” problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.