What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar “misgeneralization”? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
uh
is your proposal “use the true reward function, and then you won’t get misaligned AI”?
That’s all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If you stop the human from receiving reward for eating ice cream, then the human no longer becomes more inclined to navigate towards eating ice cream in the future.
Note that I’m not saying this is an easy task, especially since modern RL methods often use learned reward functions whose exact contours are unknown to their creators.
But from what I can tell, Yudkowsky’s position is that we need an entirely new paradigm to even begin to address these sorts of failures.
These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it’s possible under the current paradigm.
If so, how? Where’s the system that, on eating ice cream, realizes “oh no! This is a bad action that should not receive reward!” and overrides the reward machinery? How was it trained?
I think when Eliezer says “we need an entirely new paradigm”, he means something like “if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that’s better than RL.” Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).
In contrast, I think we can explain humans’ tendency to like ice cream using the standard language of reinforcement learning. It doesn’t require that we adopt an entirely new paradigm before we can even get a handle on such issues.
He’s not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He’s saying that unless you can predict RL’s failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
My shard theory inspired story is to make an AI that:
Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
if it fails before you top out the scaling I think you probably lose
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
How much is this central to your story of how things go well?
I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
If more confident in jailbreak prevention and software hardening, secrecy is less important.
First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.
is your proposal “use the true reward function, and then you won’t get misaligned AI”?
No. I’m not proposing anything here. I’m arguing that Yudkowsky’s ice cream example doesn’t actually illustrate an alignment-relevant failure mode in RL.
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data. From that perspective, the reason humans like ice cream is because they were trained to do so. To prevent AIs from behaving badly due to this particular reason, you can just refrain from training them to behave badly (they may behave badly for other reasons, of course).
I also think evolution is mechanistically very different from deep learning, such that it’s near-useless to try to use evolutionary outcomes as a basis for making predictions about deep learning alignment outcomes.
See my other reply for a longer explanation of my perspective.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
Note that this provides an obvious route to alignment using conventional engineering practice.
Why does the AGI system need to update at all “out in the world”. This is highly unreliable. As events happen in the real world that the system doesn’t expect, add the (expectation, ground truth) tuples to a log and then train a simulator on the log from all instances of the system, then train the system on the updated simulator.
So only train in batches and use code in the simulator that “rewards” behavior that accomplishes the intent of the designers.
As I understand it, the security mindset asserts a premise that’s roughly: “The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions.”
This seems… like a correct description but it’s missing the spirit?
Like the intuitions are primarily about “what features are salient” and “what thoughts are easy to think.”
However, I don’t see why this should be the case.
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break.
I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask “how does this break?” or “what happens if the AI thinks about this?”.
Additionally, there’s a straightforward reason why alignment research (specifically the part of alignment that’s about training AIs to have good values) is not like security: there’s usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.
I must admit some frustration, here; in this section it feels like your point is “look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!” In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house!
Maybe we just have some linguistic disagreement? “Sure, computer security is relevant to transformative AI but not LLMs”? If so, then I think the earlier point about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn’t thinking about them, then are you confident they will continue to work when the system is thinking about them?
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries. If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It’s also a problem we’ve made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from “gaming” a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on.
Ok, it sounds to me like you’re saying:
“When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there’s not a demon that will create another obstacle given that you surmounted this one.”
That is, training processes are not neutral; there’s the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
Cool, we agree on this point.
my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries.
I think we agree here on the local point but disagree on its significance to the broader argument. [I’m not sure how much we agree-I think of training dynamics as ‘neutral’, but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly ‘trying’ to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits ‘AI’ at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it’s not that the (loss function, training set) is thinking back at you on its own; it’s that the AI that’s created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that’s optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I think a story of “Goodfellow was unusually good at making GANs and this is why he got it right on his first try” is more compelling to me than “GANs were easy actually”.
I think it’s straightforward to explain why humans “misgeneralized” to liking ice cream.
I don’t yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it’s a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.
“such food sources” feels a little like it’s eliding the distinction between “high-quality food sources of the ancestral environment” and “foods like ice cream”; the training dataset couldn’t differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream). Yudkowsky’s primary point with this section, as I understand it, is that even if you-as-evolution know that you want g the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functions f need to be excluded.
Thank you for your extensive engagement! From this and your other comment, I think you have a pretty different view of how we should generalize from the evidence provided by evolution to plausible alignment outcomes. Hopefully, this comment will clarify my perspective.
I don’t yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it’s a misgeneralization instead of things working as expected.
I put misgeneralize in scare quotes because what happens in the human case isn’t actually misgeneralization, as commonly understood in machine learning. The human RL process goes like:
The human eats ice cream
The human gets reward
The human becomes more likely to eat ice cream
So, as a result of the RL process, is that the human became more likely to do the action that led to reward. That’s totally inline with the standard understanding of what reward does. It’s what you’d expect, and not a misgeneralization. You can easily predict that the human would like ice cream, by just looking at which of their actions led to reward during training. You’ll see “ate ice cream” followed by “reward”, and then you predict that they probably like eating ice cream.
the training dataset couldn’t differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream).
What training data? There was no training data involved, other than the ice cream. The human in the modern environment wasn’t in the ancestral environment. The evolutionary history of one’s ancestors is not part of one’s own within lifetime training data.
In my frame, there isn’t any “test” environment at all. The human’s lifetime is their “training” process, where they’re continuously receiving a stream of RL signals from the circuitry hardcoded by evolution. Those RL signals upweight ice cream seeking, and so the human seeks ice cream.
You can say that evolution had an “intent” behind the hardcoded circuitry, and humans in the current environment don’t fulfill this intent. But I don’t think evolution’s “intent” matters here. We’re not evolution. We can actually choose an AI’s training data, and we can directly choose what rewards to associate with each of the AI’s actions on that data. Evolution cannot do either of those things.
Evolution does this very weird and limited “bi-level” optimization process, where it searches over simple data labeling functions (your hardcoded reward circuitry[1]), then runs humans as an online RL process on whatever data they encounter in their lifetimes, with no further intervention from evolution whatsoever (no supervision, re-labeling of misallocated rewards, gathering more or different training data to address observed issues in the human’s behavior, etc). Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally differenttype of thing than an individual deep learning training run.
Additionally, the transition from “human learning to hunt gazelle in the ancestral environment” to “human learning to like ice cream in the modern environment” isn’t even an actual train / test transition in the ML sense. It’s not an example of:
We trained the system in environment A. Now, it’s processing a different distribution of inputs from environment B, and now the system behaves differently.
It’s an example of:
We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two systems behave differently.
We want to learn more about the dynamics of distributional shifts, in the standard ML meaning of the word, not the dynamics of the weirder situation that evolution was in.
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
much stupider than us
far more limited in the cognition-shaping tools available to it
using a fundamentally different sort of approach (bi-level optimization over reward circuitry)
compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)
and not even dealing with an actual example of the phenomenon we want to understand!
I claim, as I’ve argued previously, that evolution is a terrible analogy for AGI development, and that you’re much better off thinking about human within lifetime learning trajectories instead.
Beyond all the issues associated with trying to make any sort of inference from evolution to ML, there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
For any possible behavior X that a learning system could acquire, there are exactly two mechanisms by which X could arise:
We directly train the system to do X. E.g., ‘the system does X in training, and we reward it for doing X in training’, or ‘we hand-write a demonstration example of doing X and use imitation learning’, etc.
Literally anything else E.g., ‘a deceptively aligned mesaoptimizer does X once outside the training process’, ‘training included sub-components of the behavior X, which the system then combined together into the full behavior X once outside of the training process’, or ‘the training dataset contained spurious correlations such that the system’s test-time behavior misgeneralized to doing X, even though it never did X during training’, and so on.
“Humans liking ice cream” arises due to the first mechanism. The system (human) does X (eat ice cream) and gets reward.
So, for a bad behavior X to arise from an AI’s training process, in a manner analogous to how “liking ice cream” arose in human within lifetime learning, the AI would have to exhibit behavior X during training and be rewarded for doing so.
Most misalignment scenarios worth much thought have a “treacherous turn” part that goes “the AI seemed to behave well during training, but then it behaved badly during deployment”. This one doesn’t. The AI behaves the same during training and deployment (I assume we’re not rewarding it for executing a treacherous turn during training).
And other parts of your learning process like brain architecture, hyperparameters, sensory wiring map, etc. But I’m focusing on reward circuitry for this discussion.
(I’m going to include text from this other comment of yours so that I can respond to thematically similar things in one spot.)
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data.
Agreed that our perspectives differ. According to me, there are two different things going on (the self-organization of the brain during a lifetime and the natural selection over genes across lifetimes), both of which can be thought of as ‘training’.
Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally differenttype of thing than an individual deep learning training run.
It seems to me like if we’re looking at a particular deployed LLM, there are two main timescales, the long foundational training and then the contextual deployment. I think this looks a lot like having an evolutionary history of lots of ‘lifetimes’ which are relevant solely due to how they impact your process-of-interpreting-input, followed by your current lifetime in which you interpret some input.
That is, the ‘lifetime of learning’ that humans have corresponds to whatever’s going on inside the transformer as it reads thru a prompt context window. It probably includes some stuff like gradient descent, but not cleanly, and certainly not done by the outside operators.
What the outside operators can do is more like “run more generations”, often with deliberately chosen environments. [Of course SGD is different from genetic algorithms, and so the analogy isn’t precise; it’s much more like a single individual deciding how to evolve than a population selectively avoiding relatively bad approaches.]
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
For me, it’s this Bismarck quote:
Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.
If it looks to me like evolution made the mistake that I am trying to avoid, I would like to figure out which features of what it did caused that mistake, and whether or not my approach has the same features.
And, like, I think basically all five of your bullet points apply to LLM training?
“much stupider than us” → check? The LLM training system is using a lot of intelligence, to be sure, but there are relevant subtasks within it that are being run at subhuman levels.
“far more limited in the cognition-shaping tools available to it” → I am tempted to check this one, in that the cognition-shaping tools we can deploy at our leisure are much less limited than the ones we can deploy while training on a corpus, but I do basically agree that evolution probably used fewer tools than we can, and certainly had less foresight than our tools do.
“using a fundamentally different sort of approach (bi-level optimization over reward circuitry)” → I think the operative bit of this is probably the bi-level optimization, which I think is still happening, and not the bit where it’s for reward circuitry. If you think the reward circuitry is the operative bit, that might be an interesting discussion to try to ground out on.
One note here is that I think we’re going to get a weird reversal of human evolution, where the ‘unsupervised learning via predictions’ moves from the lifetime to the history. But this probably makes things harder, because now you have facts-about-the-world mixed up with your rewards and computational pathway design and all that.
“compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)” → check? Like, this seems like the primary reason why people are interested in deep learning over figuring out how things work and programming them the hard way. If you want to create something that can generate plausible text, training a randomly initialized net on <many> examples of next-token prediction is way easier than writing it all down yourself. Like, applying tiny tweaks over millions of generations is how gradient descent works!
“and not even dealing with an actual example of the phenomenon we want to understand!” → check? Like, prediction is the objective people used because it was easy to do in an unsupervised fashion, but they mostly don’t want to know how likely text strings are. They want to have conversations, they want to get answers, and so on.
I also think those five criticisms apply to human learning, tho it’s less obvious / might have to be limited to some parts of the human learning process. (For example, the control system that's determining which of my nerves to prune because of disuse seems much stupider than I am, but is only one component of learning.)
there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
I agree with this section (given the premise); it does seem right that “the agent does the thing it’s rewarded to do” is not an inner alignment problem.
It does seem like an outer alignment problem (yes I get that you probably don’t like that framing). Like, I by default expect people to directly train the system to do things that they don’t want the system to do, because they went for simplicity instead of correctness when setting up training, or to try to build systems with general capabilities (which means the ability to do X) while not realizing that they need to simultaneously suppress the undesired capabilities.
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned “I like licking cold things” and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I’m pretty sure I misunderstood your point of view earlier, but I’m not sure this makes any more sense to me. Seems like you’re saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn’t seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.
Finally, I’d note that having a “security mindset” seems like a terrible approach for raising human children to have good values
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.
Yes, this isn’t necessarily related to the ‘values’ part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up inferring the correct rule.
In short, I think the part of ‘raising children’ which involves the kids being intelligent as well and independently minded does benefit from security mindset.
As you mention in the next paragraph, this is a long-standing disagreement; I might as well point at the discussion of the relevance of raising human children to instilling goals in an AI in The Detached Lever Fallacy. The short summary of it is that humans have a wide range of options for their ‘values’, and are running some strategy of learning from their environment (including their parents and their style of raising children) which values to adopt. The situation with AI seems substantially different—why make an AI design that chooses whether to be good or bad based on whether you’re nice to it, when you could instead have it choose to always be good? [Note that this is distinct from “always be nice”; you could decide that your good AI can tell users that they’re being bad users!]
Yudkowsky’s own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn’t expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)
I think you’re basically misunderstanding and misrepresenting Yudkowsky’s argument from 2008. He’s not saying “you can’t make an AI out of neural networks”, he’s saying “your design sharing a single feature with the brain does not mean it will also share the brain’s intelligence.” As well, I don’t think he’s arguing about how AI will actually get made; I think he’s mostly criticizing the actual AGI developers/enthusiasts that he saw at the time (who were substantially less intelligent and capable than the modern batch of AGI developers).
I think that post has held up pretty well. The architectures used to organize neural networks are quite important, not just the base element. Someone whose only plan was to make their ANN wide would not reach AGI; they needed to do something else, that didn’t just rely on surface analogies.
seem very implausible when considered in the context of the human learning process (could a human’s visual cortex become “deceptively aligned” to the objective of modeling their visual field?).
I think it would probably be strange for the visual field to do this. But I think it’s not that uncommon for other parts of the brain to do this; higher level, most abstract / “psychological” parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call ‘deceptively aligned’ when they’re maladaptive. The idea of metacognitive blindspots also seems related.
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, ‘most abstract / “psychological”’ are more entangled in future decision-making. They’re more “online”, with greater ability to influence their future training data.
I think it’s not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.
John Wentworth describes the possibility of “optimization demons”, self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.
Additionally, I think that, if deep learning models develop such phenomena, then the brain likely does so as well.
I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like ‘cognitive cancer’ than ‘part of how values form’, I think, but in part that’s because the term comes with the disapproval built in).
Given the greater evidence available for general ML research, being well calibrated about the difficulty of general ML research is the first step to being well calibrated about the difficulty of ML alignment research.
I think I agree with this point but want to explicitly note the switch from the phrase ‘AI alignment research’ to ‘ML alignment research’; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they’re the same or interchangeable that seems like a disagreement.
[For example, I think ML alignment research includes stuff like “will our learned function be robust to distributional shift in the inputs?” and “does our model discriminate against protected classes?” whereas AI alignment research includes stuff like “will our system be robust to changes in the number of inputs?” and “is our model deceiving us about its level of understanding?”. They’re related in some ways, but pretty deeply distinct.]
There’s no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.
...
It’s thus vastly easier to align models to goals where we have many examples of people executing said goals.
I think there’s a deep disconnect here on whether interpolation is enough or whether we need extrapolation.
The point of the strawberry alignment problem is “here’s a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?”. If your ambitions are simply to have AI customer service bots, you don’t need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem.
More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth’s potential, that isn’t a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that’s too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).
I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
But for the rest of it, I don’t see this as addressing the case for pessimism, which is not problems from the reference class that contains “the LLM sometimes outputs naughty sentences” but instead problems from the reference class that contains “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model.”
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the ‘helpful, harmless, honest’ alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don’t expect that they will easily transfer to those new challenges.
But for the rest of it, I don’t see this as addressing the case for pessimism, which is not problems from the reference class that contains “the LLM sometimes outputs naughty sentences” but instead problems from the reference class that contains “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model.”
I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than “the LLM sometimes outputs naughty sentences”. E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.
I don’t think ontological collapse is a real issue (or at least, not an issue that appropriate training data can’t solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization.
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the ‘helpful, harmless, honest’ alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients?
If you’re referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It’s an incredibly powerful and flexible approach, one that’s only marginally less general than reinforcement learning itself (can’t use it for things you can’t build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot’s action outputs in the biological domain, and be able to shape its behavior there.
If you’re referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model’s behaviors, then I think the answer is “variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works”.
People often translate non-lingual modalities into language so LLMs can operate in their “native element” in those other domains. Assuming you don’t do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model’s behaviors.
However, if the bio-model were acting multi-modally by e.g., alternating between biological sequence outputs and natural language planning of what to use those outputs for, then I expect the RLHF would constrain the language portions of that dialog. Then, there are two options:
Bio-bot’s multi-modal outputs don’t correctly ground between language and bio-sequences.
In this case, bio-bot’s language planning doesn’t correctly describe the sequences its outputting, so the RLHF doesn’t constrain those sequences.
However, if bio-bot doesn’t ground cross-modally, than bio-bot also can’t benefit from its ability to plan in the language modality to better use its bio modality capabilities (which are presumably much better for planning than its bio-modality).
Bio-bot’s multi-modal outputs DO correctly ground between language and bio-sequences.
In that case, the RLHF-constrained language does correctly describe the bio-sequences, and so the language-only RLHF training does also constrain bio-bot’s biology-related behavior.
Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don’t expect that they will easily transfer to those new challenges.
Whereas I see future alignment challenges as intimately tied to those we’ve had to tackle for previous, less capable models. E.g., your bio-bot example is basically a problem of cross-modality grounding, on which there has been an enormous amount of past work, driven by the fact that cross-modality grounding is a problem for systems across very broad ranges of capabilities.
I think the bolded text is about Yudkowsky himself being wrong.
That is also how I interpreted it.
If you have a bunch of specific arguments and sources of evidence that you think all point towards a particular conclusion X, then discovering that you’re wrong about something should, in expectation, reduce your confidence in X.
I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying “if I were wrong, it would only mean I should have been more confident!”
Yudkowsky is not the aerospace engineer building the rocket who’s saying “the rocket will work because of reasons A, B, C, etc”.
I think he is (inside of the example). He’s saying “suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?”
Note that ‘expectation’ is referring to the confidence level inside an argument, but arguments aren’t Bayesians; it’s the outside agent that shouldn’t be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn’t work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality.
If this weren’t true—if engineers were calibrated or pessimistic—then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
It cannot be the case that successful value alignment requires perfect adversarial robustness.
It seems like the argument structure here is something like:
This requirement is too stringent for humans to follow
Humans have successful value alignment
Therefore this requirement cannot be necessary for successful value alignment.
I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.
So any reasoning that’s like “well so long as it’s not unusual we can be sure it’s safe” runs into the thing where we’re living in the acute risk period. The usual is not safe!
Similarly, an AI that knows it’s vulnerable to adversarial attacks, and wants to avoid being attacked successfully, will take steps to protect itself against such attacks. I think creating AIs with such meta-preferences is far easier than creating AIs that are perfectly immune to all possible adversarial attacks.
This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.
[FWIW I also don’t think we want an AI that’s perfectly robust to all possible adversarial attacks; I think we want one that’s adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I’m mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]
I think this is extremely misleading. Firstly, real-world data in high dimensions basically never look like spheres. Such data almost always cluster in extremely compact manifolds, whose internal volume is minuscule compared to the full volume of the space they’re embedded in.
I agree with your picture of how manifolds work; I don’t think it actually disagrees all that much with Yudkowsky’s.
That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it’s 3, for the three dimensions of color-space.)
And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture.
(I agree if Yudkowsky’s picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don’t think this is what he’s imagining for that argument.)
Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We’ve gotten very far in instantiating human-like behaviors by training networks on human-like data. I’m saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.
Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implicit loss functions human brains actually minimize, so it’s kind of surprising we’ve even gotten this far. The implication I’m pointing to is that it’s feasible to get really close to human minds along important dimensions related to values and behaviors, even without replicating all the quirks of human mental architecture.
This seems like way too high a bar. It seems clear that you can have transformative or risky AI systems that are still worse than humans at some tasks. This seems like the most likely outcome to me.
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host’s question of “How is superintelligence different than general intelligence?”)
I have a lot of responses to specific points; I’m going to make them as children comment to this comment.
uh
is your proposal “use the true reward function, and then you won’t get misaligned AI”?
These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it’s possible under the current paradigm.
If so, how? Where’s the system that, on eating ice cream, realizes “oh no! This is a bad action that should not receive reward!” and overrides the reward machinery? How was it trained?
I think when Eliezer says “we need an entirely new paradigm”, he means something like “if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that’s better than RL.” Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).
He’s not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He’s saying that unless you can predict RL’s failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
My shard theory inspired story is to make an AI that:
Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
How much is this central to your story of how things go well?
I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
Stop it how?
Vulnerable world hypothesis (but takeover risk rather than destruction risk). That + first mover advantage could stop things pretty decisively without requiring ASI alignment
As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
If more confident in jailbreak prevention and software hardening, secrecy is less important.
First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.
No. I’m not proposing anything here. I’m arguing that Yudkowsky’s ice cream example doesn’t actually illustrate an alignment-relevant failure mode in RL.
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data. From that perspective, the reason humans like ice cream is because they were trained to do so. To prevent AIs from behaving badly due to this particular reason, you can just refrain from training them to behave badly (they may behave badly for other reasons, of course).
I also think evolution is mechanistically very different from deep learning, such that it’s near-useless to try to use evolutionary outcomes as a basis for making predictions about deep learning alignment outcomes.
See my other reply for a longer explanation of my perspective.
I’ve replied over there.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
Note that this provides an obvious route to alignment using conventional engineering practice.
Why does the AGI system need to update at all “out in the world”. This is highly unreliable. As events happen in the real world that the system doesn’t expect, add the (expectation, ground truth) tuples to a log and then train a simulator on the log from all instances of the system, then train the system on the updated simulator.
So only train in batches and use code in the simulator that “rewards” behavior that accomplishes the intent of the designers.
This seems… like a correct description but it’s missing the spirit?
Like the intuitions are primarily about “what features are salient” and “what thoughts are easy to think.”
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break.
I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask “how does this break?” or “what happens if the AI thinks about this?”.
What’s your story for specification gaming?
I must admit some frustration, here; in this section it feels like your point is “look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!” In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house!
Maybe we just have some linguistic disagreement? “Sure, computer security is relevant to transformative AI but not LLMs”? If so, then I think the earlier point about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn’t thinking about them, then are you confident they will continue to work when the system is thinking about them?
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries. If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It’s also a problem we’ve made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from “gaming” a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
Ok, it sounds to me like you’re saying:
“When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there’s not a demon that will create another obstacle given that you surmounted this one.”
That is, training processes are not neutral; there’s the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
Cool, we agree on this point.
I think we agree here on the local point but disagree on its significance to the broader argument. [I’m not sure how much we agree-I think of training dynamics as ‘neutral’, but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly ‘trying’ to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits ‘AI’ at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it’s not that the (loss function, training set) is thinking back at you on its own; it’s that the AI that’s created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that’s optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
As a commentary from an observer: this is distinct from the proposition “the minds created with those laws are not thinking back.”
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I think a story of “Goodfellow was unusually good at making GANs and this is why he got it right on his first try” is more compelling to me than “GANs were easy actually”.
I don’t yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it’s a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
“such food sources” feels a little like it’s eliding the distinction between “high-quality food sources of the ancestral environment” and “foods like ice cream”; the training dataset couldn’t differentiate between functions
f
andg
but those functions differ in their reaction to the test set (ice cream). Yudkowsky’s primary point with this section, as I understand it, is that even if you-as-evolution know that you wantg
the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functionsf
need to be excluded.Thank you for your extensive engagement! From this and your other comment, I think you have a pretty different view of how we should generalize from the evidence provided by evolution to plausible alignment outcomes. Hopefully, this comment will clarify my perspective.
I put misgeneralize in scare quotes because what happens in the human case isn’t actually misgeneralization, as commonly understood in machine learning. The human RL process goes like:
The human eats ice cream
The human gets reward
The human becomes more likely to eat ice cream
So, as a result of the RL process, is that the human became more likely to do the action that led to reward. That’s totally inline with the standard understanding of what reward does. It’s what you’d expect, and not a misgeneralization. You can easily predict that the human would like ice cream, by just looking at which of their actions led to reward during training. You’ll see “ate ice cream” followed by “reward”, and then you predict that they probably like eating ice cream.
What training data? There was no training data involved, other than the ice cream. The human in the modern environment wasn’t in the ancestral environment. The evolutionary history of one’s ancestors is not part of one’s own within lifetime training data.
In my frame, there isn’t any “test” environment at all. The human’s lifetime is their “training” process, where they’re continuously receiving a stream of RL signals from the circuitry hardcoded by evolution. Those RL signals upweight ice cream seeking, and so the human seeks ice cream.
You can say that evolution had an “intent” behind the hardcoded circuitry, and humans in the current environment don’t fulfill this intent. But I don’t think evolution’s “intent” matters here. We’re not evolution. We can actually choose an AI’s training data, and we can directly choose what rewards to associate with each of the AI’s actions on that data. Evolution cannot do either of those things.
Evolution does this very weird and limited “bi-level” optimization process, where it searches over simple data labeling functions (your hardcoded reward circuitry[1]), then runs humans as an online RL process on whatever data they encounter in their lifetimes, with no further intervention from evolution whatsoever (no supervision, re-labeling of misallocated rewards, gathering more or different training data to address observed issues in the human’s behavior, etc). Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally different type of thing than an individual deep learning training run.
Additionally, the transition from “human learning to hunt gazelle in the ancestral environment” to “human learning to like ice cream in the modern environment” isn’t even an actual train / test transition in the ML sense. It’s not an example of:
It’s an example of:
We want to learn more about the dynamics of distributional shifts, in the standard ML meaning of the word, not the dynamics of the weirder situation that evolution was in.
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
much stupider than us
far more limited in the cognition-shaping tools available to it
using a fundamentally different sort of approach (bi-level optimization over reward circuitry)
compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)
and not even dealing with an actual example of the phenomenon we want to understand!
I claim, as I’ve argued previously, that evolution is a terrible analogy for AGI development, and that you’re much better off thinking about human within lifetime learning trajectories instead.
Beyond all the issues associated with trying to make any sort of inference from evolution to ML, there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
For any possible behavior X that a learning system could acquire, there are exactly two mechanisms by which X could arise:
We directly train the system to do X.
E.g., ‘the system does X in training, and we reward it for doing X in training’, or ‘we hand-write a demonstration example of doing X and use imitation learning’, etc.
Literally anything else
E.g., ‘a deceptively aligned mesaoptimizer does X once outside the training process’, ‘training included sub-components of the behavior X, which the system then combined together into the full behavior X once outside of the training process’, or ‘the training dataset contained spurious correlations such that the system’s test-time behavior misgeneralized to doing X, even though it never did X during training’, and so on.
“Humans liking ice cream” arises due to the first mechanism. The system (human) does X (eat ice cream) and gets reward.
So, for a bad behavior X to arise from an AI’s training process, in a manner analogous to how “liking ice cream” arose in human within lifetime learning, the AI would have to exhibit behavior X during training and be rewarded for doing so.
Most misalignment scenarios worth much thought have a “treacherous turn” part that goes “the AI seemed to behave well during training, but then it behaved badly during deployment”. This one doesn’t. The AI behaves the same during training and deployment (I assume we’re not rewarding it for executing a treacherous turn during training).
And other parts of your learning process like brain architecture, hyperparameters, sensory wiring map, etc. But I’m focusing on reward circuitry for this discussion.
(I’m going to include text from this other comment of yours so that I can respond to thematically similar things in one spot.)
Agreed that our perspectives differ. According to me, there are two different things going on (the self-organization of the brain during a lifetime and the natural selection over genes across lifetimes), both of which can be thought of as ‘training’.
It seems to me like if we’re looking at a particular deployed LLM, there are two main timescales, the long foundational training and then the contextual deployment. I think this looks a lot like having an evolutionary history of lots of ‘lifetimes’ which are relevant solely due to how they impact your process-of-interpreting-input, followed by your current lifetime in which you interpret some input.
That is, the ‘lifetime of learning’ that humans have corresponds to whatever’s going on inside the transformer as it reads thru a prompt context window. It probably includes some stuff like gradient descent, but not cleanly, and certainly not done by the outside operators.
What the outside operators can do is more like “run more generations”, often with deliberately chosen environments. [Of course SGD is different from genetic algorithms, and so the analogy isn’t precise; it’s much more like a single individual deciding how to evolve than a population selectively avoiding relatively bad approaches.]
For me, it’s this Bismarck quote:
If it looks to me like evolution made the mistake that I am trying to avoid, I would like to figure out which features of what it did caused that mistake, and whether or not my approach has the same features.
And, like, I think basically all five of your bullet points apply to LLM training?
“much stupider than us” → check? The LLM training system is using a lot of intelligence, to be sure, but there are relevant subtasks within it that are being run at subhuman levels.
“far more limited in the cognition-shaping tools available to it” → I am tempted to check this one, in that the cognition-shaping tools we can deploy at our leisure are much less limited than the ones we can deploy while training on a corpus, but I do basically agree that evolution probably used fewer tools than we can, and certainly had less foresight than our tools do.
“using a fundamentally different sort of approach (bi-level optimization over reward circuitry)” → I think the operative bit of this is probably the bi-level optimization, which I think is still happening, and not the bit where it’s for reward circuitry. If you think the reward circuitry is the operative bit, that might be an interesting discussion to try to ground out on.
One note here is that I think we’re going to get a weird reversal of human evolution, where the ‘unsupervised learning via predictions’ moves from the lifetime to the history. But this probably makes things harder, because now you have facts-about-the-world mixed up with your rewards and computational pathway design and all that.
“compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)” → check? Like, this seems like the primary reason why people are interested in deep learning over figuring out how things work and programming them the hard way. If you want to create something that can generate plausible text, training a randomly initialized net on <many> examples of next-token prediction is way easier than writing it all down yourself. Like, applying tiny tweaks over millions of generations is how gradient descent works!
“and not even dealing with an actual example of the phenomenon we want to understand!” → check? Like, prediction is the objective people used because it was easy to do in an unsupervised fashion, but they mostly don’t want to know how likely text strings are. They want to have conversations, they want to get answers, and so on.
I also think those five criticisms apply to human learning, tho it’s less obvious / might have to be limited to some parts of the human learning process. (For example,
the control system that's determining which of my nerves to prune because of disuse
seems much stupider than I am, but is only one component of learning.)I agree with this section (given the premise); it does seem right that “the agent does the thing it’s rewarded to do” is not an inner alignment problem.
It does seem like an outer alignment problem (yes I get that you probably don’t like that framing). Like, I by default expect people to directly train the system to do things that they don’t want the system to do, because they went for simplicity instead of correctness when setting up training, or to try to build systems with general capabilities (which means the ability to do X) while not realizing that they need to simultaneously suppress the undesired capabilities.
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned “I like licking cold things” and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I’m pretty sure I misunderstood your point of view earlier, but I’m not sure this makes any more sense to me. Seems like you’re saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn’t seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.
Yes, this isn’t necessarily related to the ‘values’ part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up inferring the correct rule.
In short, I think the part of ‘raising children’ which involves the kids being intelligent as well and independently minded does benefit from security mindset.
As you mention in the next paragraph, this is a long-standing disagreement; I might as well point at the discussion of the relevance of raising human children to instilling goals in an AI in The Detached Lever Fallacy. The short summary of it is that humans have a wide range of options for their ‘values’, and are running some strategy of learning from their environment (including their parents and their style of raising children) which values to adopt. The situation with AI seems substantially different—why make an AI design that chooses whether to be good or bad based on whether you’re nice to it, when you could instead have it choose to always be good? [Note that this is distinct from “always be nice”; you could decide that your good AI can tell users that they’re being bad users!]
I think you’re basically misunderstanding and misrepresenting Yudkowsky’s argument from 2008. He’s not saying “you can’t make an AI out of neural networks”, he’s saying “your design sharing a single feature with the brain does not mean it will also share the brain’s intelligence.” As well, I don’t think he’s arguing about how AI will actually get made; I think he’s mostly criticizing the actual AGI developers/enthusiasts that he saw at the time (who were substantially less intelligent and capable than the modern batch of AGI developers).
I think that post has held up pretty well. The architectures used to organize neural networks are quite important, not just the base element. Someone whose only plan was to make their ANN wide would not reach AGI; they needed to do something else, that didn’t just rely on surface analogies.
There was an entire thread about Yudkowsky’s past opinions on neural networks, and I agree with Alex Turner’s evidence that Yudkowsky was dubious.
I also think people who used brain analogies as the basis for optimism about neural networks were right to do so.
I think it would probably be strange for the visual field to do this. But I think it’s not that uncommon for other parts of the brain to do this; higher level, most abstract / “psychological” parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call ‘deceptively aligned’ when they’re maladaptive. The idea of metacognitive blindspots also seems related.
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, ‘most abstract / “psychological”’ are more entangled in future decision-making. They’re more “online”, with greater ability to influence their future training data.
I think it’s not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.
I’ve recently decided to revisit this post. I’ll try to address all un-responded to comments in the next ~2 weeks.
Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?. I agree no one knows how much of an issue this will be for deep learning.
I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like ‘cognitive cancer’ than ‘part of how values form’, I think, but in part that’s because the term comes with the disapproval built in).
I think I agree with this point but want to explicitly note the switch from the phrase ‘AI alignment research’ to ‘ML alignment research’; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they’re the same or interchangeable that seems like a disagreement.
[For example, I think ML alignment research includes stuff like “will our learned function be robust to distributional shift in the inputs?” and “does our model discriminate against protected classes?” whereas AI alignment research includes stuff like “will our system be robust to changes in the number of inputs?” and “is our model deceiving us about its level of understanding?”. They’re related in some ways, but pretty deeply distinct.]
I think there’s a deep disconnect here on whether interpolation is enough or whether we need extrapolation.
The point of the strawberry alignment problem is “here’s a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?”. If your ambitions are simply to have AI customer service bots, you don’t need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem.
More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth’s potential, that isn’t a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that’s too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).
Part of this is just straight disagreement, I think; see So8res’s Sharp Left Turn and follow-on discussion.
But for the rest of it, I don’t see this as addressing the case for pessimism, which is not problems from the reference class that contains “the LLM sometimes outputs naughty sentences” but instead problems from the reference class that contains “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model.”
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the ‘helpful, harmless, honest’ alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don’t expect that they will easily transfer to those new challenges.
Evolution provides no evidence for the sharp left turn
I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than “the LLM sometimes outputs naughty sentences”. E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.
I don’t think ontological collapse is a real issue (or at least, not an issue that appropriate training data can’t solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization.
If you’re referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It’s an incredibly powerful and flexible approach, one that’s only marginally less general than reinforcement learning itself (can’t use it for things you can’t build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot’s action outputs in the biological domain, and be able to shape its behavior there.
If you’re referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model’s behaviors, then I think the answer is “variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works”.
People often translate non-lingual modalities into language so LLMs can operate in their “native element” in those other domains. Assuming you don’t do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model’s behaviors.
However, if the bio-model were acting multi-modally by e.g., alternating between biological sequence outputs and natural language planning of what to use those outputs for, then I expect the RLHF would constrain the language portions of that dialog. Then, there are two options:
Bio-bot’s multi-modal outputs don’t correctly ground between language and bio-sequences.
In this case, bio-bot’s language planning doesn’t correctly describe the sequences its outputting, so the RLHF doesn’t constrain those sequences.
However, if bio-bot doesn’t ground cross-modally, than bio-bot also can’t benefit from its ability to plan in the language modality to better use its bio modality capabilities (which are presumably much better for planning than its bio-modality).
Bio-bot’s multi-modal outputs DO correctly ground between language and bio-sequences.
In that case, the RLHF-constrained language does correctly describe the bio-sequences, and so the language-only RLHF training does also constrain bio-bot’s biology-related behavior.
Whereas I see future alignment challenges as intimately tied to those we’ve had to tackle for previous, less capable models. E.g., your bio-bot example is basically a problem of cross-modality grounding, on which there has been an enormous amount of past work, driven by the fact that cross-modality grounding is a problem for systems across very broad ranges of capabilities.
That is also how I interpreted it.
I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying “if I were wrong, it would only mean I should have been more confident!”
I think he is (inside of the example). He’s saying “suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?”
Note that ‘expectation’ is referring to the confidence level inside an argument, but arguments aren’t Bayesians; it’s the outside agent that shouldn’t be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn’t work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality.
If this weren’t true—if engineers were calibrated or pessimistic—then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
It seems like the argument structure here is something like:
This requirement is too stringent for humans to follow
Humans have successful value alignment
Therefore this requirement cannot be necessary for successful value alignment.
I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.
So any reasoning that’s like “well so long as it’s not unusual we can be sure it’s safe” runs into the thing where we’re living in the acute risk period. The usual is not safe!
This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.
[FWIW I also don’t think we want an AI that’s perfectly robust to all possible adversarial attacks; I think we want one that’s adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I’m mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]
I agree with your picture of how manifolds work; I don’t think it actually disagrees all that much with Yudkowsky’s.
That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it’s 3, for the three dimensions of color-space.)
And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture.
(I agree if Yudkowsky’s picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don’t think this is what he’s imagining for that argument.)
Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We’ve gotten very far in instantiating human-like behaviors by training networks on human-like data. I’m saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.
Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implicit loss functions human brains actually minimize, so it’s kind of surprising we’ve even gotten this far. The implication I’m pointing to is that it’s feasible to get really close to human minds along important dimensions related to values and behaviors, even without replicating all the quirks of human mental architecture.
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host’s question of “How is superintelligence different than general intelligence?”)