I think it’s straightforward to explain why humans “misgeneralized” to liking ice cream.
I don’t yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it’s a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.
“such food sources” feels a little like it’s eliding the distinction between “high-quality food sources of the ancestral environment” and “foods like ice cream”; the training dataset couldn’t differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream). Yudkowsky’s primary point with this section, as I understand it, is that even if you-as-evolution know that you want g the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functions f need to be excluded.
Thank you for your extensive engagement! From this and your other comment, I think you have a pretty different view of how we should generalize from the evidence provided by evolution to plausible alignment outcomes. Hopefully, this comment will clarify my perspective.
I don’t yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it’s a misgeneralization instead of things working as expected.
I put misgeneralize in scare quotes because what happens in the human case isn’t actually misgeneralization, as commonly understood in machine learning. The human RL process goes like:
The human eats ice cream
The human gets reward
The human becomes more likely to eat ice cream
So, as a result of the RL process, is that the human became more likely to do the action that led to reward. That’s totally inline with the standard understanding of what reward does. It’s what you’d expect, and not a misgeneralization. You can easily predict that the human would like ice cream, by just looking at which of their actions led to reward during training. You’ll see “ate ice cream” followed by “reward”, and then you predict that they probably like eating ice cream.
the training dataset couldn’t differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream).
What training data? There was no training data involved, other than the ice cream. The human in the modern environment wasn’t in the ancestral environment. The evolutionary history of one’s ancestors is not part of one’s own within lifetime training data.
In my frame, there isn’t any “test” environment at all. The human’s lifetime is their “training” process, where they’re continuously receiving a stream of RL signals from the circuitry hardcoded by evolution. Those RL signals upweight ice cream seeking, and so the human seeks ice cream.
You can say that evolution had an “intent” behind the hardcoded circuitry, and humans in the current environment don’t fulfill this intent. But I don’t think evolution’s “intent” matters here. We’re not evolution. We can actually choose an AI’s training data, and we can directly choose what rewards to associate with each of the AI’s actions on that data. Evolution cannot do either of those things.
Evolution does this very weird and limited “bi-level” optimization process, where it searches over simple data labeling functions (your hardcoded reward circuitry[1]), then runs humans as an online RL process on whatever data they encounter in their lifetimes, with no further intervention from evolution whatsoever (no supervision, re-labeling of misallocated rewards, gathering more or different training data to address observed issues in the human’s behavior, etc). Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally differenttype of thing than an individual deep learning training run.
Additionally, the transition from “human learning to hunt gazelle in the ancestral environment” to “human learning to like ice cream in the modern environment” isn’t even an actual train / test transition in the ML sense. It’s not an example of:
We trained the system in environment A. Now, it’s processing a different distribution of inputs from environment B, and now the system behaves differently.
It’s an example of:
We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two systems behave differently.
We want to learn more about the dynamics of distributional shifts, in the standard ML meaning of the word, not the dynamics of the weirder situation that evolution was in.
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
much stupider than us
far more limited in the cognition-shaping tools available to it
using a fundamentally different sort of approach (bi-level optimization over reward circuitry)
compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)
and not even dealing with an actual example of the phenomenon we want to understand!
I claim, as I’ve argued previously, that evolution is a terrible analogy for AGI development, and that you’re much better off thinking about human within lifetime learning trajectories instead.
Beyond all the issues associated with trying to make any sort of inference from evolution to ML, there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
For any possible behavior X that a learning system could acquire, there are exactly two mechanisms by which X could arise:
We directly train the system to do X. E.g., ‘the system does X in training, and we reward it for doing X in training’, or ‘we hand-write a demonstration example of doing X and use imitation learning’, etc.
Literally anything else E.g., ‘a deceptively aligned mesaoptimizer does X once outside the training process’, ‘training included sub-components of the behavior X, which the system then combined together into the full behavior X once outside of the training process’, or ‘the training dataset contained spurious correlations such that the system’s test-time behavior misgeneralized to doing X, even though it never did X during training’, and so on.
“Humans liking ice cream” arises due to the first mechanism. The system (human) does X (eat ice cream) and gets reward.
So, for a bad behavior X to arise from an AI’s training process, in a manner analogous to how “liking ice cream” arose in human within lifetime learning, the AI would have to exhibit behavior X during training and be rewarded for doing so.
Most misalignment scenarios worth much thought have a “treacherous turn” part that goes “the AI seemed to behave well during training, but then it behaved badly during deployment”. This one doesn’t. The AI behaves the same during training and deployment (I assume we’re not rewarding it for executing a treacherous turn during training).
And other parts of your learning process like brain architecture, hyperparameters, sensory wiring map, etc. But I’m focusing on reward circuitry for this discussion.
(I’m going to include text from this other comment of yours so that I can respond to thematically similar things in one spot.)
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data.
Agreed that our perspectives differ. According to me, there are two different things going on (the self-organization of the brain during a lifetime and the natural selection over genes across lifetimes), both of which can be thought of as ‘training’.
Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally differenttype of thing than an individual deep learning training run.
It seems to me like if we’re looking at a particular deployed LLM, there are two main timescales, the long foundational training and then the contextual deployment. I think this looks a lot like having an evolutionary history of lots of ‘lifetimes’ which are relevant solely due to how they impact your process-of-interpreting-input, followed by your current lifetime in which you interpret some input.
That is, the ‘lifetime of learning’ that humans have corresponds to whatever’s going on inside the transformer as it reads thru a prompt context window. It probably includes some stuff like gradient descent, but not cleanly, and certainly not done by the outside operators.
What the outside operators can do is more like “run more generations”, often with deliberately chosen environments. [Of course SGD is different from genetic algorithms, and so the analogy isn’t precise; it’s much more like a single individual deciding how to evolve than a population selectively avoiding relatively bad approaches.]
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
For me, it’s this Bismarck quote:
Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.
If it looks to me like evolution made the mistake that I am trying to avoid, I would like to figure out which features of what it did caused that mistake, and whether or not my approach has the same features.
And, like, I think basically all five of your bullet points apply to LLM training?
“much stupider than us” → check? The LLM training system is using a lot of intelligence, to be sure, but there are relevant subtasks within it that are being run at subhuman levels.
“far more limited in the cognition-shaping tools available to it” → I am tempted to check this one, in that the cognition-shaping tools we can deploy at our leisure are much less limited than the ones we can deploy while training on a corpus, but I do basically agree that evolution probably used fewer tools than we can, and certainly had less foresight than our tools do.
“using a fundamentally different sort of approach (bi-level optimization over reward circuitry)” → I think the operative bit of this is probably the bi-level optimization, which I think is still happening, and not the bit where it’s for reward circuitry. If you think the reward circuitry is the operative bit, that might be an interesting discussion to try to ground out on.
One note here is that I think we’re going to get a weird reversal of human evolution, where the ‘unsupervised learning via predictions’ moves from the lifetime to the history. But this probably makes things harder, because now you have facts-about-the-world mixed up with your rewards and computational pathway design and all that.
“compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)” → check? Like, this seems like the primary reason why people are interested in deep learning over figuring out how things work and programming them the hard way. If you want to create something that can generate plausible text, training a randomly initialized net on <many> examples of next-token prediction is way easier than writing it all down yourself. Like, applying tiny tweaks over millions of generations is how gradient descent works!
“and not even dealing with an actual example of the phenomenon we want to understand!” → check? Like, prediction is the objective people used because it was easy to do in an unsupervised fashion, but they mostly don’t want to know how likely text strings are. They want to have conversations, they want to get answers, and so on.
I also think those five criticisms apply to human learning, tho it’s less obvious / might have to be limited to some parts of the human learning process. (For example, the control system that's determining which of my nerves to prune because of disuse seems much stupider than I am, but is only one component of learning.)
there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
I agree with this section (given the premise); it does seem right that “the agent does the thing it’s rewarded to do” is not an inner alignment problem.
It does seem like an outer alignment problem (yes I get that you probably don’t like that framing). Like, I by default expect people to directly train the system to do things that they don’t want the system to do, because they went for simplicity instead of correctness when setting up training, or to try to build systems with general capabilities (which means the ability to do X) while not realizing that they need to simultaneously suppress the undesired capabilities.
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned “I like licking cold things” and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I’m pretty sure I misunderstood your point of view earlier, but I’m not sure this makes any more sense to me. Seems like you’re saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn’t seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.
I don’t yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it’s a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
“such food sources” feels a little like it’s eliding the distinction between “high-quality food sources of the ancestral environment” and “foods like ice cream”; the training dataset couldn’t differentiate between functions
f
andg
but those functions differ in their reaction to the test set (ice cream). Yudkowsky’s primary point with this section, as I understand it, is that even if you-as-evolution know that you wantg
the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functionsf
need to be excluded.Thank you for your extensive engagement! From this and your other comment, I think you have a pretty different view of how we should generalize from the evidence provided by evolution to plausible alignment outcomes. Hopefully, this comment will clarify my perspective.
I put misgeneralize in scare quotes because what happens in the human case isn’t actually misgeneralization, as commonly understood in machine learning. The human RL process goes like:
The human eats ice cream
The human gets reward
The human becomes more likely to eat ice cream
So, as a result of the RL process, is that the human became more likely to do the action that led to reward. That’s totally inline with the standard understanding of what reward does. It’s what you’d expect, and not a misgeneralization. You can easily predict that the human would like ice cream, by just looking at which of their actions led to reward during training. You’ll see “ate ice cream” followed by “reward”, and then you predict that they probably like eating ice cream.
What training data? There was no training data involved, other than the ice cream. The human in the modern environment wasn’t in the ancestral environment. The evolutionary history of one’s ancestors is not part of one’s own within lifetime training data.
In my frame, there isn’t any “test” environment at all. The human’s lifetime is their “training” process, where they’re continuously receiving a stream of RL signals from the circuitry hardcoded by evolution. Those RL signals upweight ice cream seeking, and so the human seeks ice cream.
You can say that evolution had an “intent” behind the hardcoded circuitry, and humans in the current environment don’t fulfill this intent. But I don’t think evolution’s “intent” matters here. We’re not evolution. We can actually choose an AI’s training data, and we can directly choose what rewards to associate with each of the AI’s actions on that data. Evolution cannot do either of those things.
Evolution does this very weird and limited “bi-level” optimization process, where it searches over simple data labeling functions (your hardcoded reward circuitry[1]), then runs humans as an online RL process on whatever data they encounter in their lifetimes, with no further intervention from evolution whatsoever (no supervision, re-labeling of misallocated rewards, gathering more or different training data to address observed issues in the human’s behavior, etc). Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally different type of thing than an individual deep learning training run.
Additionally, the transition from “human learning to hunt gazelle in the ancestral environment” to “human learning to like ice cream in the modern environment” isn’t even an actual train / test transition in the ML sense. It’s not an example of:
It’s an example of:
We want to learn more about the dynamics of distributional shifts, in the standard ML meaning of the word, not the dynamics of the weirder situation that evolution was in.
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
much stupider than us
far more limited in the cognition-shaping tools available to it
using a fundamentally different sort of approach (bi-level optimization over reward circuitry)
compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)
and not even dealing with an actual example of the phenomenon we want to understand!
I claim, as I’ve argued previously, that evolution is a terrible analogy for AGI development, and that you’re much better off thinking about human within lifetime learning trajectories instead.
Beyond all the issues associated with trying to make any sort of inference from evolution to ML, there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
For any possible behavior X that a learning system could acquire, there are exactly two mechanisms by which X could arise:
We directly train the system to do X.
E.g., ‘the system does X in training, and we reward it for doing X in training’, or ‘we hand-write a demonstration example of doing X and use imitation learning’, etc.
Literally anything else
E.g., ‘a deceptively aligned mesaoptimizer does X once outside the training process’, ‘training included sub-components of the behavior X, which the system then combined together into the full behavior X once outside of the training process’, or ‘the training dataset contained spurious correlations such that the system’s test-time behavior misgeneralized to doing X, even though it never did X during training’, and so on.
“Humans liking ice cream” arises due to the first mechanism. The system (human) does X (eat ice cream) and gets reward.
So, for a bad behavior X to arise from an AI’s training process, in a manner analogous to how “liking ice cream” arose in human within lifetime learning, the AI would have to exhibit behavior X during training and be rewarded for doing so.
Most misalignment scenarios worth much thought have a “treacherous turn” part that goes “the AI seemed to behave well during training, but then it behaved badly during deployment”. This one doesn’t. The AI behaves the same during training and deployment (I assume we’re not rewarding it for executing a treacherous turn during training).
And other parts of your learning process like brain architecture, hyperparameters, sensory wiring map, etc. But I’m focusing on reward circuitry for this discussion.
(I’m going to include text from this other comment of yours so that I can respond to thematically similar things in one spot.)
Agreed that our perspectives differ. According to me, there are two different things going on (the self-organization of the brain during a lifetime and the natural selection over genes across lifetimes), both of which can be thought of as ‘training’.
It seems to me like if we’re looking at a particular deployed LLM, there are two main timescales, the long foundational training and then the contextual deployment. I think this looks a lot like having an evolutionary history of lots of ‘lifetimes’ which are relevant solely due to how they impact your process-of-interpreting-input, followed by your current lifetime in which you interpret some input.
That is, the ‘lifetime of learning’ that humans have corresponds to whatever’s going on inside the transformer as it reads thru a prompt context window. It probably includes some stuff like gradient descent, but not cleanly, and certainly not done by the outside operators.
What the outside operators can do is more like “run more generations”, often with deliberately chosen environments. [Of course SGD is different from genetic algorithms, and so the analogy isn’t precise; it’s much more like a single individual deciding how to evolve than a population selectively avoiding relatively bad approaches.]
For me, it’s this Bismarck quote:
If it looks to me like evolution made the mistake that I am trying to avoid, I would like to figure out which features of what it did caused that mistake, and whether or not my approach has the same features.
And, like, I think basically all five of your bullet points apply to LLM training?
“much stupider than us” → check? The LLM training system is using a lot of intelligence, to be sure, but there are relevant subtasks within it that are being run at subhuman levels.
“far more limited in the cognition-shaping tools available to it” → I am tempted to check this one, in that the cognition-shaping tools we can deploy at our leisure are much less limited than the ones we can deploy while training on a corpus, but I do basically agree that evolution probably used fewer tools than we can, and certainly had less foresight than our tools do.
“using a fundamentally different sort of approach (bi-level optimization over reward circuitry)” → I think the operative bit of this is probably the bi-level optimization, which I think is still happening, and not the bit where it’s for reward circuitry. If you think the reward circuitry is the operative bit, that might be an interesting discussion to try to ground out on.
One note here is that I think we’re going to get a weird reversal of human evolution, where the ‘unsupervised learning via predictions’ moves from the lifetime to the history. But this probably makes things harder, because now you have facts-about-the-world mixed up with your rewards and computational pathway design and all that.
“compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)” → check? Like, this seems like the primary reason why people are interested in deep learning over figuring out how things work and programming them the hard way. If you want to create something that can generate plausible text, training a randomly initialized net on <many> examples of next-token prediction is way easier than writing it all down yourself. Like, applying tiny tweaks over millions of generations is how gradient descent works!
“and not even dealing with an actual example of the phenomenon we want to understand!” → check? Like, prediction is the objective people used because it was easy to do in an unsupervised fashion, but they mostly don’t want to know how likely text strings are. They want to have conversations, they want to get answers, and so on.
I also think those five criticisms apply to human learning, tho it’s less obvious / might have to be limited to some parts of the human learning process. (For example,
the control system that's determining which of my nerves to prune because of disuse
seems much stupider than I am, but is only one component of learning.)I agree with this section (given the premise); it does seem right that “the agent does the thing it’s rewarded to do” is not an inner alignment problem.
It does seem like an outer alignment problem (yes I get that you probably don’t like that framing). Like, I by default expect people to directly train the system to do things that they don’t want the system to do, because they went for simplicity instead of correctness when setting up training, or to try to build systems with general capabilities (which means the ability to do X) while not realizing that they need to simultaneously suppress the undesired capabilities.
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned “I like licking cold things” and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I’m pretty sure I misunderstood your point of view earlier, but I’m not sure this makes any more sense to me. Seems like you’re saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn’t seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.