OK, thanks. So, I totally buy that e.g. pretrained GPT-4 has a pretty good understanding of concepts like honesty and morality and niceness, somewhere in its giant library of concepts it understands. The difficulty is getting it to both (a) be a powerful agent and (b) have its goals be all and only the concepts we want them to be, arranged in the structure we want them to be arranged in.
An obvious strategy is to use clever prompting and maybe some other techniques like W2SG to get a reward model that looks at stuff and classifies it by concepts like honesty, niceness, etc. and then use that reward model to train the agent. (See Appendix G of the W2SG paper for more on this plan)
However, for various reasons I don’t expect this to work. But it might.
Are you basically saying: This’ll probably work?
(In general I like the builder-breaker dialectic in which someone proposes a fairly concrete AGI alignment scheme and someone else says how they think it might go wrong)
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
OK, thanks. So, I totally buy that e.g. pretrained GPT-4 has a pretty good understanding of concepts like honesty and morality and niceness, somewhere in its giant library of concepts it understands. The difficulty is getting it to both (a) be a powerful agent and (b) have its goals be all and only the concepts we want them to be, arranged in the structure we want them to be arranged in.
An obvious strategy is to use clever prompting and maybe some other techniques like W2SG to get a reward model that looks at stuff and classifies it by concepts like honesty, niceness, etc. and then use that reward model to train the agent. (See Appendix G of the W2SG paper for more on this plan)
However, for various reasons I don’t expect this to work. But it might.
Are you basically saying: This’ll probably work?
(In general I like the builder-breaker dialectic in which someone proposes a fairly concrete AGI alignment scheme and someone else says how they think it might go wrong)
I’m not actually proposing this alignment plan.
My alignment plan would be:
Step 1: Create reasonably large datasets about human values that encode what humans value in a lot of situations.
Step 2: Use a reward model to curate the dataset and select the best human data to feed to the AI.
Step 3: Put in the curated data before it can start to conceive of deceptive alignment, to bias it towards human friendliness.
Step 4: Repeat until the training data is done.
I also like the direction of using COT interpretability on LLMs.
Cool, thanks!
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
I also like COT interpretability.
Note that I talk more about how to align an AI here, so see this:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#Adding_Minimal_Necessary_Complexity
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#While_humans_are_in_control__Alex_would_be_incentivized_to__play_the_training_game_
I also talk about a plan to create aligned AI here more here, so go check this comment out too if you want more information:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ