It seems to me that the usual arguments still go through. We don’t know how to specify the preferences of an LLM (relevant search term: “inner alignment”). Even if we did have some slot we could write the preferences into, we don’t have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.)
Separately, note that the “complexity of value” claim is distinct from the “fragility of value” claim. Value being complex doesn’t mean that the AI won’t learn it (given a reason to). Rather, it suggests that the AI will likely also learn a variety of other things (like “what the humans think they want” and “what the humans’ revealed preferences are given their current unendorsed moral failings” and etc.). This makes pointing to the right concept difficult. “Fragility of value” then separately argues that if you point to even slightly the wrong concept when choosing what a superintelligence optimizes, the total value of the future is likely radically diminished.
To be clear, I’d agree that the use of the phrase “algorithmic complexity” in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of ‘value’ is simple relative to some language that specifies this AI’s concepts. And the AI’s concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to “normal” reference machines that people intuitively think of when they hear “algorithmic complexity” (like the lambda calculus, say). And so it maybe true that value is complex relative to a “normal” reference machine, and simple relative to the AI’s observational history, thereby turning out not to pose all that much of an alignment obstacle.
In that case (which I don’t particularly expect), I’d say “value was in fact complex, and this turned out not to be a great obstacle to alignment” (though I wouldn’t begrudge someone else saying “I define complexity of value relative to the AI’s observation-history, and in that sense, value turned out to be simple”).
Insofar as you are arguing “(1) the arbital page on complexity of value does not convincingly argue that this will matter to alignment in practice, and (2) LLMs are significant evidence that ‘value’ won’t be complex relative to the actual AI concept-languages we’re going to get”, I agree with (1), and disagree with (2), while again noting that there’s a reason I deployed the fragility of value (and not the complexity of value) in response to your original question (and am only discussing complexity of value here because you brought it up).
re: (1), I note that the argument is elsewhere (and has the form “there will be lots of nearby concepts” + “getting almost the right concept does not get you almost a good result”, as I alluded to above). I’d agree that one leg of possible support for this argument (namely “humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity”) won’t apply in the case of LLMs. (I don’t particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this ‘value’ concept, but the hard bit is in making it care. But it is a way things could in principle have gone, that would have made complexity-of-value much more of an obstacle, and things did not in fact go that way.)
re: (2), I just don’t see LLMs as providing much evidence yet about whether the concepts they’re picking up are compact or correct (cf. monkeys don’t have an IGF concept).
Okay, that clarifies a lot. But the last paragraph I find surprising.
re: (2), I just don’t see LLMs as providing much evidence yet about whether the concepts they’re picking up are compact or correct (cf. monkeys don’t have an IGF concept).
If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, and this is a problem?
Regarding monkeys, they apparently don’t understand the IGF concept as they are not good enough at reasoning abstractly about evolution and unobservable entities (genes), and they lack the empirical knowledge like humans until recently. I’m not sure how that would be an argument against advanced LLMs grasping the concepts they seem to grasp.
Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.
Humans also don’t have a “clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized” in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training.
The evolution analogy seems more analogous to an LLM that’s rewarded for telling funny jokes, but it doesn’t understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it’s not surprising that the LLM wouldn’t be funny when taken out of its training distribution. But that’s just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it’s much more likely that the property of being humorous would generalize outside its training distribution.
LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There’s still a problem of getting LLMs to care and act on those values, but it’s noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.
Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren’t anyway capable of learning the value “inclusive genetic fitness”, since they can’t understand such a complex concept (and humans didn’t understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional AI.
In general, the fact that base LLMs have a very good (perhaps even human level) ability of understanding text seems to make the fine-tuning phases more robust, as there is less likelihood of misunderstanding training samples. Which would make hitting a fragile target easier. Then the danger seems to come more from goal misspecification, e.g. picking the wrong principles for Constitutional AI.
It seems to me that the usual arguments still go through. We don’t know how to specify the preferences of an LLM (relevant search term: “inner alignment”). Even if we did have some slot we could write the preferences into, we don’t have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.)
Separately, note that the “complexity of value” claim is distinct from the “fragility of value” claim. Value being complex doesn’t mean that the AI won’t learn it (given a reason to). Rather, it suggests that the AI will likely also learn a variety of other things (like “what the humans think they want” and “what the humans’ revealed preferences are given their current unendorsed moral failings” and etc.). This makes pointing to the right concept difficult. “Fragility of value” then separately argues that if you point to even slightly the wrong concept when choosing what a superintelligence optimizes, the total value of the future is likely radically diminished.
To be clear, I’d agree that the use of the phrase “algorithmic complexity” in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of ‘value’ is simple relative to some language that specifies this AI’s concepts. And the AI’s concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to “normal” reference machines that people intuitively think of when they hear “algorithmic complexity” (like the lambda calculus, say). And so it maybe true that value is complex relative to a “normal” reference machine, and simple relative to the AI’s observational history, thereby turning out not to pose all that much of an alignment obstacle.
In that case (which I don’t particularly expect), I’d say “value was in fact complex, and this turned out not to be a great obstacle to alignment” (though I wouldn’t begrudge someone else saying “I define complexity of value relative to the AI’s observation-history, and in that sense, value turned out to be simple”).
Insofar as you are arguing “(1) the arbital page on complexity of value does not convincingly argue that this will matter to alignment in practice, and (2) LLMs are significant evidence that ‘value’ won’t be complex relative to the actual AI concept-languages we’re going to get”, I agree with (1), and disagree with (2), while again noting that there’s a reason I deployed the fragility of value (and not the complexity of value) in response to your original question (and am only discussing complexity of value here because you brought it up).
re: (1), I note that the argument is elsewhere (and has the form “there will be lots of nearby concepts” + “getting almost the right concept does not get you almost a good result”, as I alluded to above). I’d agree that one leg of possible support for this argument (namely “humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity”) won’t apply in the case of LLMs. (I don’t particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this ‘value’ concept, but the hard bit is in making it care. But it is a way things could in principle have gone, that would have made complexity-of-value much more of an obstacle, and things did not in fact go that way.)
re: (2), I just don’t see LLMs as providing much evidence yet about whether the concepts they’re picking up are compact or correct (cf. monkeys don’t have an IGF concept).
Okay, that clarifies a lot. But the last paragraph I find surprising.
If LLMs are good at understanding the meaning of human text, they must to be good at understanding human concepts, since concepts are just meanings of words the LLM understands. Do you doubt they are really understanding text as well as it seems? Or do you mean they are picking up other, non-human, concepts as well, and this is a problem?
Regarding monkeys, they apparently don’t understand the IGF concept as they are not good enough at reasoning abstractly about evolution and unobservable entities (genes), and they lack the empirical knowledge like humans until recently. I’m not sure how that would be an argument against advanced LLMs grasping the concepts they seem to grasp.
Humans also don’t have a “clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized” in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training.
The evolution analogy seems more analogous to an LLM that’s rewarded for telling funny jokes, but it doesn’t understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it’s not surprising that the LLM wouldn’t be funny when taken out of its training distribution. But that’s just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it’s much more likely that the property of being humorous would generalize outside its training distribution.
LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There’s still a problem of getting LLMs to care and act on those values, but it’s noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.
Inner alignment is a problem, but it seems less of a problem than in the monkey example. The monkey values were trained using a relatively blunt form of genetic algorithm, and monkeys aren’t anyway capable of learning the value “inclusive genetic fitness”, since they can’t understand such a complex concept (and humans didn’t understand it historically). By contrast, advanced base LLMs are presumably able to understand the theory of CEV about as well as a human, and they could be finetuned by using that understanding, e.g. with something like Constitutional AI.
In general, the fact that base LLMs have a very good (perhaps even human level) ability of understanding text seems to make the fine-tuning phases more robust, as there is less likelihood of misunderstanding training samples. Which would make hitting a fragile target easier. Then the danger seems to come more from goal misspecification, e.g. picking the wrong principles for Constitutional AI.