Rohin Shah comments on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah 26 Aug 2024 7:28 UTC
LW: 5 AF: 2
1
AF
I’m not going to repeat all of the literature on debate here, but as brief pointers:
- Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
- AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE—PSPACE-complete problems typically involve exponential-sized trees)
- Cross-examination is discussed here
- This paper discusses the experiments you’d do to figure out what the human judge should be doing to make debate more effective
- The comments on this post discuss several reasons not to anchor to human institutions. There are even more reasons not to anchor to disagreements between people, but I didn’t find a place where they’ve been written up with a short search. Most centrally, disagreements between people tend to focus on getting both people to understand their position, but the theoretical story for debate does not require this.
(Also, the “arbitrary amounts of time and arbitrary amounts of explanation” was pretty central to my claim; human disagreements are way more bounded than that.)
- bhauth 26 Aug 2024 8:29 UTC
  3 points
  3
  Parent
  The scope of our argument seems to have grown beyond what a single comment thread is suitable for.
  
  AI safety via debate is 2 years before Writeup: Progress on AI Safety via Debate so the latter post should be more up-to-date. I think that post does a good job of considering potential problems; the issue is that I think the noted problems & assumptions can’t be handled well, make that approach very limited in what it can do for alignment, and aren’t really dealt with by “Doubly-efficient debate”. I don’t think such debate protocols are totally useless, but they’re certainly not a “solution to alignment”.