Current LLM behavior doesn’t seem to me like much evidence that they care about humans per se.
I’d agree that they evidence some understanding of human values (but the argument is and has always been “the AI knows but doesn’t care”; someone can probably dig up a reference to Yudkowsky arguing this as early as 2001).
I contest that the LLM’s ability to predict how a caring-human sounds is much evidence that the underlying coginiton cares similarly (insofar as it cares at all).
And even if the underlying cognition did care about the sorts of things you can sometimes get an LLM to write as if it cares about, I’d still expect that to shake out into caring about a bunch of correlates of the stuff we care about, in a manner that comes apart under the extremes of optimization.
(Search terms to read more about these topics on LW, where they’ve been discussed in depth: “a thousand shards of desire”, “value is fragile”.)
The fragility-of-value posts are mostly old. They were written before GPT-3 came out (which seemed very good at understanding human language and, consequently, human values), before instruction fine-tuning was successfully employed, and before forms of preference learning like RLHF or Constitutional AI were implemented.
With this background, many arguments in articles like Eliezer’s Complexity of Value (2015) sound now implausible, questionable or in any case outdated.
I agree that foundation LLMs are just able to predict how a caring human sounds like, but fine-tuned models are no longer pure text predictors. They are biased towards producing particular types of text, which just means they value some of it more than others.
Currently these language models are just Oracles, but a future multimodal version could be capable of perception and movement. Prototypes of this sort do already exist.
Maybe they do not really care at all about what they do seem to care about, i.e. they are deceptive. But as far as I know, there is currently no significant evidence for deception.
Or they might just care about close correlates of what they seem to care about. That is a serious possibility, but given that they seem very good at understanding text from the unsupervised and very data-heavy pre-training phase, a lot of that semantic knowledge does plausibly help with the less data-heavy SL/RL fine-tuning phases, since these also involve text. The pre-trained models have a lot of common sense, which makes the fine-tuning less of a narrow target.
The bottom line is that with the advent of finetuned large language models, the following “complexity of value thesis”, from Eliezer’s Arbital article above, is no longer obviously true, and requires a modern defense:
The Complexity of Value proposition is true if, relative to viable and acceptable real-world methodologies for AI development, there isn’t any reliably knowable way to specify the AI’s object-level preferences as a structure of low algorithmic complexity, such that the result of running that AI is achieving enough of the possible value, for reasonable definitions of value.
It seems to me that the usual arguments still go through. We don’t know how to specify the preferences of an LLM (relevant search term: “inner alignment”). Even if we did have some slot we could write the preferences into, we don’t have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.)
Separately, note that the “complexity of value” claim is distinct from the “fragility of value” claim. Value being complex doesn’t mean that the AI won’t learn it (given a reason to). Rather, it suggests that the AI will likely also learn a variety of other things (like “what the humans think they want” and “what the humans’ revealed preferences are given their current unendorsed moral failings” and etc.). This makes pointing to the right concept difficult. “Fragility of value” then separately argues that if you point to even slightly the wrong concept when choosing what a superintelligence optimizes, the total value of the future is likely radically diminished.
To be clear, I’d agree that the use of the phrase “algorithmic complexity” in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of ‘value’ is simple relative to some language that specifies this AI’s concepts. And the AI’s concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to “normal” reference machines that people intuitively think of when they hear “algorithmic complexity” (like the lambda calculus, say). And so it maybe true that value is complex relative to a “normal” reference machine, and simple relative to the AI’s observational history, thereby turning out not to pose all that much of an alignment obstacle.
In that case (which I don’t particularly expect), I’d say “value was in fact complex, and this turned out not to be a great obstacle to alignment” (though I wouldn’t begrudge someone else saying “I define complexity of value relative to the AI’s observation-history, and in that sense, value turned out to be simple”).
Insofar as you are arguing “(1) the arbital page on complexity of value does not convincingly argue that this will matter to alignment in practice, and (2) LLMs are significant evidence that ‘value’ won’t be complex relative to the actual AI concept-languages we’re going to get”, I agree with (1), and disagree with (2), while again noting that there’s a reason I deployed the fragility of value (and not the complexity of value) in response to your original question (and am only discussing complexity of value here because you brought it up).
re: (1), I note that the argument is elsewhere (and has the form “there will be lots of nearby concepts” + “getting almost the right concept does not get you almost a good result”, as I alluded to above). I’d agree that one leg of possible support for this argument (namely “humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity”) won’t apply in the case of LLMs. (I don’t particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this ‘value’ concept, but the hard bit is in making it care. But it is a way things could in principle have gone, that would have made complexity-of-value much more of an obstacle, and things did not in fact go that way.)
re: (2), I just don’t see LLMs as providing much evidence yet about whether the concepts they’re picking up are compact or correct (cf. monkeys don’t have an IGF concept).
Okay, that clarifies a lot. But the last paragraph I find surprising.
re: (2), I just don’t see LLMs as providing much evidence yet about whether the concepts they’re picking up are compact or correct (cf. monkeys don’t have an IGF concept).
If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, and this is a problem?
Regarding monkeys, they apparently don’t understand the IGF concept as they are not good enough at reasoning abstractly about evolution and unobservable entities (genes), and they lack the empirical knowledge like humans until recently. I’m not sure how that would be an argument against advanced LLMs grasping the concepts they seem to grasp.
Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.
Humans also don’t have a “clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized” in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training.
The evolution analogy seems more analogous to an LLM that’s rewarded for telling funny jokes, but it doesn’t understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it’s not surprising that the LLM wouldn’t be funny when taken out of its training distribution. But that’s just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it’s much more likely that the property of being humorous would generalize outside its training distribution.
LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There’s still a problem of getting LLMs to care and act on those values, but it’s noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.
Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren’t anyway capable of learning the value “inclusive genetic fitness”, since they can’t understand such a complex concept (and humans didn’t understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional AI.
In general, the fact that base LLMs have a very good (perhaps even human level) ability of understanding text seems to make the fine-tuning phases more robust, as there is less likelihood of misunderstanding training samples. Which would make hitting a fragile target easier. Then the danger seems to come more from goal misspecification, e.g. picking the wrong principles for Constitutional AI.
Current LLM behavior doesn’t seem to me like much evidence that they care about humans per se.
I’d agree that they evidence some understanding of human values (but the argument is and has always been “the AI knows but doesn’t care”; someone can probably dig up a reference to Yudkowsky arguing this as early as 2001).
I contest that the LLM’s ability to predict how a caring-human sounds is much evidence that the underlying coginiton cares similarly (insofar as it cares at all).
And even if the underlying cognition did care about the sorts of things you can sometimes get an LLM to write as if it cares about, I’d still expect that to shake out into caring about a bunch of correlates of the stuff we care about, in a manner that comes apart under the extremes of optimization.
(Search terms to read more about these topics on LW, where they’ve been discussed in depth: “a thousand shards of desire”, “value is fragile”.)
The fragility-of-value posts are mostly old. They were written before GPT-3 came out (which seemed very good at understanding human language and, consequently, human values), before instruction fine-tuning was successfully employed, and before forms of preference learning like RLHF or Constitutional AI were implemented.
With this background, many arguments in articles like Eliezer’s Complexity of Value (2015) sound now implausible, questionable or in any case outdated.
I agree that foundation LLMs are just able to predict how a caring human sounds like, but fine-tuned models are no longer pure text predictors. They are biased towards producing particular types of text, which just means they value some of it more than others.
Currently these language models are just Oracles, but a future multimodal version could be capable of perception and movement. Prototypes of this sort do already exist.
Maybe they do not really care at all about what they do seem to care about, i.e. they are deceptive. But as far as I know, there is currently no significant evidence for deception.
Or they might just care about close correlates of what they seem to care about. That is a serious possibility, but given that they seem very good at understanding text from the unsupervised and very data-heavy pre-training phase, a lot of that semantic knowledge does plausibly help with the less data-heavy SL/RL fine-tuning phases, since these also involve text. The pre-trained models have a lot of common sense, which makes the fine-tuning less of a narrow target.
The bottom line is that with the advent of finetuned large language models, the following “complexity of value thesis”, from Eliezer’s Arbital article above, is no longer obviously true, and requires a modern defense:
It seems to me that the usual arguments still go through. We don’t know how to specify the preferences of an LLM (relevant search term: “inner alignment”). Even if we did have some slot we could write the preferences into, we don’t have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.)
Separately, note that the “complexity of value” claim is distinct from the “fragility of value” claim. Value being complex doesn’t mean that the AI won’t learn it (given a reason to). Rather, it suggests that the AI will likely also learn a variety of other things (like “what the humans think they want” and “what the humans’ revealed preferences are given their current unendorsed moral failings” and etc.). This makes pointing to the right concept difficult. “Fragility of value” then separately argues that if you point to even slightly the wrong concept when choosing what a superintelligence optimizes, the total value of the future is likely radically diminished.
To be clear, I’d agree that the use of the phrase “algorithmic complexity” in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of ‘value’ is simple relative to some language that specifies this AI’s concepts. And the AI’s concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to “normal” reference machines that people intuitively think of when they hear “algorithmic complexity” (like the lambda calculus, say). And so it maybe true that value is complex relative to a “normal” reference machine, and simple relative to the AI’s observational history, thereby turning out not to pose all that much of an alignment obstacle.
In that case (which I don’t particularly expect), I’d say “value was in fact complex, and this turned out not to be a great obstacle to alignment” (though I wouldn’t begrudge someone else saying “I define complexity of value relative to the AI’s observation-history, and in that sense, value turned out to be simple”).
Insofar as you are arguing “(1) the arbital page on complexity of value does not convincingly argue that this will matter to alignment in practice, and (2) LLMs are significant evidence that ‘value’ won’t be complex relative to the actual AI concept-languages we’re going to get”, I agree with (1), and disagree with (2), while again noting that there’s a reason I deployed the fragility of value (and not the complexity of value) in response to your original question (and am only discussing complexity of value here because you brought it up).
re: (1), I note that the argument is elsewhere (and has the form “there will be lots of nearby concepts” + “getting almost the right concept does not get you almost a good result”, as I alluded to above). I’d agree that one leg of possible support for this argument (namely “humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity”) won’t apply in the case of LLMs. (I don’t particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this ‘value’ concept, but the hard bit is in making it care. But it is a way things could in principle have gone, that would have made complexity-of-value much more of an obstacle, and things did not in fact go that way.)
re: (2), I just don’t see LLMs as providing much evidence yet about whether the concepts they’re picking up are compact or correct (cf. monkeys don’t have an IGF concept).
Okay, that clarifies a lot. But the last paragraph I find surprising.
If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, and this is a problem?
Regarding monkeys, they apparently don’t understand the IGF concept as they are not good enough at reasoning abstractly about evolution and unobservable entities (genes), and they lack the empirical knowledge like humans until recently. I’m not sure how that would be an argument against advanced LLMs grasping the concepts they seem to grasp.
Humans also don’t have a “clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized” in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training.
The evolution analogy seems more analogous to an LLM that’s rewarded for telling funny jokes, but it doesn’t understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it’s not surprising that the LLM wouldn’t be funny when taken out of its training distribution. But that’s just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it’s much more likely that the property of being humorous would generalize outside its training distribution.
LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There’s still a problem of getting LLMs to care and act on those values, but it’s noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.
Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren’t anyway capable of learning the value “inclusive genetic fitness”, since they can’t understand such a complex concept (and humans didn’t understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional AI.
In general, the fact that base LLMs have a very good (perhaps even human level) ability of understanding text seems to make the fine-tuning phases more robust, as there is less likelihood of misunderstanding training samples. Which would make hitting a fragile target easier. Then the danger seems to come more from goal misspecification, e.g. picking the wrong principles for Constitutional AI.