Daniel Kokotajlo comments on My disagreements with “AGI ruin: A List of Lethalities”

Daniel Kokotajlo Oct 2, 2024, 3:13 PM
2 points
0
Cool, thanks!

How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?

How do you measure ‘before it can start to conceive of deceptive alignment?’

How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”

I also like COT interpretability.
- Noosphere89 Oct 2, 2024, 4:31 PM
  2 points
  0
  Parent
  Note that I talk more about how to align an AI here, so see this:
  https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
  But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
  Re this:
  or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
  No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
  This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
  This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
  https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
  https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#Adding_Minimal_Necessary_Complexity
  This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
  https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#While_humans_are_in_control__Alex_would_be_incentivized_to__play_the_training_game_
  What links here?
  - Noosphere89's comment on The Hopium Wars: the AGI Entente Delusion by Max Tegmark (Oct 13, 2024, 7:36 PM; 31 points)
  - Noosphere89's comment on The Hidden Complexity of Wishes by Eliezer Yudkowsky (Oct 17, 2024, 6:50 PM; 1 point)