How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
Cool, thanks!
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
I also like COT interpretability.
Note that I talk more about how to align an AI here, so see this:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#Adding_Minimal_Necessary_Complexity
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#While_humans_are_in_control__Alex_would_be_incentivized_to__play_the_training_game_