lly I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:
Reading the linked post, it seems like the argument is that discrimination is generally easier than generation --> AIs will understand human values --> alignment generalizes further than capabilities. It’s the second step I dispute. I actually don’t know what people mean when they say capabilities will generalize further than alignment, but I know that basically all of those people expect the AGIs in question to understand human values quite well thank you very much (and just not care). Can you say more about what you mean by alignment generalizing farther than capabilities?
What I mean by alignment generalizing further than capabilities is that it’s easier to get an AI that internalizes what a human values and either has a pointer to the human’s goals or outright values what the human values internally than it is to get an AI that is very, very capable of the sort often discussed on LW, like creating nanotech or fully automating the AI research/robotics research, and also that it’s easier to generalize from old alignment situations to new alignment situations OOD than it is to generalize from being able to do old tasks like assisting humans on AI research to the new task of fully automating all AI and robotics research.
Another way to say it is that it’s easy to get an AI to value what we value, but much harder for it to actually implement what we value in real life.
OK, thanks. So, I totally buy that e.g. pretrained GPT-4 has a pretty good understanding of concepts like honesty and morality and niceness, somewhere in its giant library of concepts it understands. The difficulty is getting it to both (a) be a powerful agent and (b) have its goals be all and only the concepts we want them to be, arranged in the structure we want them to be arranged in.
An obvious strategy is to use clever prompting and maybe some other techniques like W2SG to get a reward model that looks at stuff and classifies it by concepts like honesty, niceness, etc. and then use that reward model to train the agent. (See Appendix G of the W2SG paper for more on this plan)
However, for various reasons I don’t expect this to work. But it might.
Are you basically saying: This’ll probably work?
(In general I like the builder-breaker dialectic in which someone proposes a fairly concrete AGI alignment scheme and someone else says how they think it might go wrong)
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
Reading the linked post, it seems like the argument is that discrimination is generally easier than generation --> AIs will understand human values --> alignment generalizes further than capabilities. It’s the second step I dispute. I actually don’t know what people mean when they say capabilities will generalize further than alignment, but I know that basically all of those people expect the AGIs in question to understand human values quite well thank you very much (and just not care). Can you say more about what you mean by alignment generalizing farther than capabilities?
Good question.
What I mean by alignment generalizing further than capabilities is that it’s easier to get an AI that internalizes what a human values and either has a pointer to the human’s goals or outright values what the human values internally than it is to get an AI that is very, very capable of the sort often discussed on LW, like creating nanotech or fully automating the AI research/robotics research, and also that it’s easier to generalize from old alignment situations to new alignment situations OOD than it is to generalize from being able to do old tasks like assisting humans on AI research to the new task of fully automating all AI and robotics research.
Another way to say it is that it’s easy to get an AI to value what we value, but much harder for it to actually implement what we value in real life.
OK, thanks. So, I totally buy that e.g. pretrained GPT-4 has a pretty good understanding of concepts like honesty and morality and niceness, somewhere in its giant library of concepts it understands. The difficulty is getting it to both (a) be a powerful agent and (b) have its goals be all and only the concepts we want them to be, arranged in the structure we want them to be arranged in.
An obvious strategy is to use clever prompting and maybe some other techniques like W2SG to get a reward model that looks at stuff and classifies it by concepts like honesty, niceness, etc. and then use that reward model to train the agent. (See Appendix G of the W2SG paper for more on this plan)
However, for various reasons I don’t expect this to work. But it might.
Are you basically saying: This’ll probably work?
(In general I like the builder-breaker dialectic in which someone proposes a fairly concrete AGI alignment scheme and someone else says how they think it might go wrong)
I’m not actually proposing this alignment plan.
My alignment plan would be:
Step 1: Create reasonably large datasets about human values that encode what humans value in a lot of situations.
Step 2: Use a reward model to curate the dataset and select the best human data to feed to the AI.
Step 3: Put in the curated data before it can start to conceive of deceptive alignment, to bias it towards human friendliness.
Step 4: Repeat until the training data is done.
I also like the direction of using COT interpretability on LLMs.
Cool, thanks!
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
I also like COT interpretability.
Note that I talk more about how to align an AI here, so see this:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#Adding_Minimal_Necessary_Complexity
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#While_humans_are_in_control__Alex_would_be_incentivized_to__play_the_training_game_
I also talk about a plan to create aligned AI here more here, so go check this comment out too if you want more information:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ