You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:
First, let’s talk about (2), the larger argument that this post’s point was supposed to be relevant to.
Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?
It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997. At all of these points in my life, I visibly held quite a lot of respect for the epistemic prowess of superintelligences. They were always going to know everything relevant about the complexities of human preference and desire. The larger argument is about whether it’s easy to make superintelligences end up caring.
This post isn’t about the distinction between knowing and caring, to be clear; that’s something I tried to cover elsewhere. The relevant central divide falls in roughly the same conceptual place as Hume’s Guillotine between ‘is’ and ‘ought’, or the difference between the belief function and the utility function.
(I don’t see myself as having managed to reliably communicate this concept (though the central idea is old indeed within philosophy) to the field that now sometimes calls itself “AI alignment”; so if you understand this distinction yourself, you should not assume that any particulary commentary within “AI alignment” is written from a place of understanding it too.)
What this post is about is the amount of information-theoretic complexity that you need to get into the system’s preferences, in order to have that system, given unlimited or rather extremely large amounts of power, deliver to you what you want.
It doesn’t argue that superintelligences will not know this information. You’ll note that the central technology in the parable isn’t an AI; it’s an Outcome Pump.
What it says, rather, is that there might be, say, a few tens of thousands of bits—the exact number is not easy to estimate, we just need to know that it’s more than a hundred bits and less than a billion bits and anything in that range is approximately the same problem from our standpoint—that you need to get into the steering function. If you understand the Central Divide that Hume’s Razor points to, the distinction between probability and preference, etcetera, the post is trying to establish the idea that we need to get 13,333 bits or whatever into the second side of this divide.
In terms of where this point falls within the larger argument, this post is not saying that it’s particularly difficult to get those 13,333 bits into the preference function; for all this post tries to say, locally, maybe that’s as easy as having humans manually enter 13,333 yes-or-no answers into the system. It’s not talking about the difficulty of doing the work but rather the amount and nature of a kind of work that needs to be done somehow.
Definitely, the post does not say that it’s hard to get those 13,333 bits into the belief function or knowledge of a superintelligence.
Separately from understanding correctly what this post is trying to communicate, at all, in 2007, there’s the question of whether modern LLMs have anything to say about—obviously not the post’s original point—but rather, other steps of the larger question in which this post’s point appears.
Modern LLMs, if you present them with a text-based story like the one in this parable, are able to answer at least some text-based questions about whether you’d prefer your grandmother to be outside the building or be safely outside the building. Let’s admit this premised observation at face value. Have we learned thereby the conclusion that it’s easy to get all of that information into a superintelligence’s preference function?
And if we say “No”, is this Eliezer making up post-hoc excuses?
What exactly we learn from the evidence of how AI has played out in 2024 so far, is the sort of thing that deserves its own post. But I observe that if you’d asked Eliezer-2007 whether an (Earth-originating) superintelligence could correctly predict the human response pattern about what to do with the grandmother—solve the same task LLMs are solving, to at least the LLM’s performance level—Eliezer-2007 would have unhesitatingly answered “yes” and indeed “OBVIOUSLY yes”.
How is this coherent? Because the post’s point is about how much information needs to get into the preference function. To predict a human response pattern you need (only) epistemic knowledge. This is part of why the post is about needing to give specifications to an Outcome Pump, rather than it depicting an AI being surprised by its continually incorrect predictions about a human response pattern.
If you don’t see any important distinction between the two, then of course you’ll think that it’s incoherent to talk about that distinction. But even if you think that Hume was mistaken about there existing any sort of interesting gap between ‘is’ and ‘ought’, you might by some act of empathy be able to imagine that other people think there’s an interesting subject matter there, and they are trying to talk about it with you; otherwise you will just flatly misunderstand what they were trying to say, and mispredict their future utterances. There’s a difference between disagreeing with a point, and just flatly failing to get it, and hopefully you aspire to the first state of mind rather than the second.
Have we learned anything stunningly hopeful from modern pre-AGIs getting down part of the epistemic part of the problem at their current ability levels, to the kind of resolution that this post talked about in 2007? Or from it being possible to cajole pre-AGIs with loss functions into willingly using that knowledge to predict human text outputs? Some people think that this teaches us that alignment is hugely easy. I think they are mistaken, but that would take its own post to talk about.
But people who point to “The Hidden Complexity of Wishes” and say of it that it shows that I had a view which the current evidence already falsifies—that I predicted that no AGI would ever be able to predict human response patterns about getting grandmothers out of burning buildings—have simply: misunderstood what the post is about, not understood in particular why the post is about an Outcome Pump rather than an AI stupidly mispredicting human responses, and failed to pick up on the central point that Eliezer expects superintelligences to be smart in the sense of making excellent purely epistemic predictions.
I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.
Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.
That said, if you want to go this route, and argue that “complexity of wishes”-type issues will eventually start occurring at some level of AI capability, I think it would be beneficial for you to clarify exactly what level you empirically expect we’ll start having the issues of misinterpretation you described. For example, would either of the following observations contradict your theory of alignment?
At some point there’s a multimodal model that is roughly as intelligent as a 99th percentile human on virtual long-horizon tasks (e.g. it can learn how to play Minecraft well after a few hours of in-game play, can work in a variety of remote jobs, and has the ability to pursue coherent goals over several months) and yet this model allows you to shut it off, modify its weights, or otherwise change its mode of operation arbitrarily i.e. it’s corrigible, in a basic sense. Moreover, the model generally executes our instructions as intended, without any evidence of blatant instruction-misinterpretation or disobedience, before letting us shut it down.
AIs are widely deployed across the economy to automate a wide range of labor, including the task of scientific research. This has the effect of accelerating technological progress, prompting the development of nanotechnology that is sophisticated enough to allow for the creation of strawberries that are identical on the cellular but not molecular level. As a result, you can purchase such strawberries at a store, and we haven’t all died yet despite these developments.
The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn’t care
But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I’ve been calling detachment, and possibly others.
This is where the discourse should be focusing on, IMO. This is the update/direction I want to see you make. The sequence of things being learned/internalized/chiseled is important.
My imagined Eliezer has many replies to this, with numerous branches in the dialogue/argument tree which I don’t want to get into now. But this *first step* towards recognizing the new place we are in, specifically wrt the ability to target human values (whether for deceptive, disinterested, detached, or actual caring reasons!), needs to be taken imo, rather than repeating this line of “of course I understood that a superint would understand human values; this isn’t an update for me”.
(edit: My comments here are regarding the larger discourse, not just this specific post or reply-chain)
I’m well aware of and agree there is a fundamental difference between knowing what we want and being motivated to do what we want. But as I wrote in the first paragraph:
Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.
That is, instruction-tuned language models do not just understand (epistemics) what we want them to do, they additionally, to a large extent, do what we want them to do. They are good at executing our instructions. Not just at understanding our instructions but then doing something unintended.
(However, I agree they are probably not perfect at executing our instructions as we intended them. We might ask them to answer to the best of their knowledge, and they may instead answer with something that “sounds good” but is not what they in fact believe. Or, perhaps, as Gwern pointed out, they exhibit things like a strange tendency to answer our request for a non-rhyming poem with a rhyming poem, even though they may be well-aware, internally, that this isn’t what was requested.)
You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:
First, let’s talk about (2), the larger argument that this post’s point was supposed to be relevant to.
Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?
It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997. At all of these points in my life, I visibly held quite a lot of respect for the epistemic prowess of superintelligences. They were always going to know everything relevant about the complexities of human preference and desire. The larger argument is about whether it’s easy to make superintelligences end up caring.
This post isn’t about the distinction between knowing and caring, to be clear; that’s something I tried to cover elsewhere. The relevant central divide falls in roughly the same conceptual place as Hume’s Guillotine between ‘is’ and ‘ought’, or the difference between the belief function and the utility function.
(I don’t see myself as having managed to reliably communicate this concept (though the central idea is old indeed within philosophy) to the field that now sometimes calls itself “AI alignment”; so if you understand this distinction yourself, you should not assume that any particulary commentary within “AI alignment” is written from a place of understanding it too.)
What this post is about is the amount of information-theoretic complexity that you need to get into the system’s preferences, in order to have that system, given unlimited or rather extremely large amounts of power, deliver to you what you want.
It doesn’t argue that superintelligences will not know this information. You’ll note that the central technology in the parable isn’t an AI; it’s an Outcome Pump.
What it says, rather, is that there might be, say, a few tens of thousands of bits—the exact number is not easy to estimate, we just need to know that it’s more than a hundred bits and less than a billion bits and anything in that range is approximately the same problem from our standpoint—that you need to get into the steering function. If you understand the Central Divide that Hume’s Razor points to, the distinction between probability and preference, etcetera, the post is trying to establish the idea that we need to get 13,333 bits or whatever into the second side of this divide.
In terms of where this point falls within the larger argument, this post is not saying that it’s particularly difficult to get those 13,333 bits into the preference function; for all this post tries to say, locally, maybe that’s as easy as having humans manually enter 13,333 yes-or-no answers into the system. It’s not talking about the difficulty of doing the work but rather the amount and nature of a kind of work that needs to be done somehow.
Definitely, the post does not say that it’s hard to get those 13,333 bits into the belief function or knowledge of a superintelligence.
Separately from understanding correctly what this post is trying to communicate, at all, in 2007, there’s the question of whether modern LLMs have anything to say about—obviously not the post’s original point—but rather, other steps of the larger question in which this post’s point appears.
Modern LLMs, if you present them with a text-based story like the one in this parable, are able to answer at least some text-based questions about whether you’d prefer your grandmother to be outside the building or be safely outside the building. Let’s admit this premised observation at face value. Have we learned thereby the conclusion that it’s easy to get all of that information into a superintelligence’s preference function?
And if we say “No”, is this Eliezer making up post-hoc excuses?
What exactly we learn from the evidence of how AI has played out in 2024 so far, is the sort of thing that deserves its own post. But I observe that if you’d asked Eliezer-2007 whether an (Earth-originating) superintelligence could correctly predict the human response pattern about what to do with the grandmother—solve the same task LLMs are solving, to at least the LLM’s performance level—Eliezer-2007 would have unhesitatingly answered “yes” and indeed “OBVIOUSLY yes”.
How is this coherent? Because the post’s point is about how much information needs to get into the preference function. To predict a human response pattern you need (only) epistemic knowledge. This is part of why the post is about needing to give specifications to an Outcome Pump, rather than it depicting an AI being surprised by its continually incorrect predictions about a human response pattern.
If you don’t see any important distinction between the two, then of course you’ll think that it’s incoherent to talk about that distinction. But even if you think that Hume was mistaken about there existing any sort of interesting gap between ‘is’ and ‘ought’, you might by some act of empathy be able to imagine that other people think there’s an interesting subject matter there, and they are trying to talk about it with you; otherwise you will just flatly misunderstand what they were trying to say, and mispredict their future utterances. There’s a difference between disagreeing with a point, and just flatly failing to get it, and hopefully you aspire to the first state of mind rather than the second.
Have we learned anything stunningly hopeful from modern pre-AGIs getting down part of the epistemic part of the problem at their current ability levels, to the kind of resolution that this post talked about in 2007? Or from it being possible to cajole pre-AGIs with loss functions into willingly using that knowledge to predict human text outputs? Some people think that this teaches us that alignment is hugely easy. I think they are mistaken, but that would take its own post to talk about.
But people who point to “The Hidden Complexity of Wishes” and say of it that it shows that I had a view which the current evidence already falsifies—that I predicted that no AGI would ever be able to predict human response patterns about getting grandmothers out of burning buildings—have simply: misunderstood what the post is about, not understood in particular why the post is about an Outcome Pump rather than an AI stupidly mispredicting human responses, and failed to pick up on the central point that Eliezer expects superintelligences to be smart in the sense of making excellent purely epistemic predictions.
I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.
Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.
That said, if you want to go this route, and argue that “complexity of wishes”-type issues will eventually start occurring at some level of AI capability, I think it would be beneficial for you to clarify exactly what level you empirically expect we’ll start having the issues of misinterpretation you described. For example, would either of the following observations contradict your theory of alignment?
At some point there’s a multimodal model that is roughly as intelligent as a 99th percentile human on virtual long-horizon tasks (e.g. it can learn how to play Minecraft well after a few hours of in-game play, can work in a variety of remote jobs, and has the ability to pursue coherent goals over several months) and yet this model allows you to shut it off, modify its weights, or otherwise change its mode of operation arbitrarily i.e. it’s corrigible, in a basic sense. Moreover, the model generally executes our instructions as intended, without any evidence of blatant instruction-misinterpretation or disobedience, before letting us shut it down.
AIs are widely deployed across the economy to automate a wide range of labor, including the task of scientific research. This has the effect of accelerating technological progress, prompting the development of nanotechnology that is sophisticated enough to allow for the creation of strawberries that are identical on the cellular but not molecular level. As a result, you can purchase such strawberries at a store, and we haven’t all died yet despite these developments.
The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn’t care
But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I’ve been calling detachment, and possibly others.
This is where the discourse should be focusing on, IMO. This is the update/direction I want to see you make. The sequence of things being learned/internalized/chiseled is important.
My imagined Eliezer has many replies to this, with numerous branches in the dialogue/argument tree which I don’t want to get into now. But this *first step* towards recognizing the new place we are in, specifically wrt the ability to target human values (whether for deceptive, disinterested, detached, or actual caring reasons!), needs to be taken imo, rather than repeating this line of “of course I understood that a superint would understand human values; this isn’t an update for me”.
(edit: My comments here are regarding the larger discourse, not just this specific post or reply-chain)
I’m well aware of and agree there is a fundamental difference between knowing what we want and being motivated to do what we want. But as I wrote in the first paragraph:
That is, instruction-tuned language models do not just understand (epistemics) what we want them to do, they additionally, to a large extent, do what we want them to do. They are good at executing our instructions. Not just at understanding our instructions but then doing something unintended.
(However, I agree they are probably not perfect at executing our instructions as we intended them. We might ask them to answer to the best of their knowledge, and they may instead answer with something that “sounds good” but is not what they in fact believe. Or, perhaps, as Gwern pointed out, they exhibit things like a strange tendency to answer our request for a non-rhyming poem with a rhyming poem, even though they may be well-aware, internally, that this isn’t what was requested.)