Daniel Kokotajlo comments on Discussion with Nate Soares on a key alignment difficulty

Daniel Kokotajlo 14 Mar 2023 21:50 UTC
LW: 9 AF: 5
9
AF
I found this post very helpful, thanks! If I find time to try to form a more gears-level independent impression about alignment difficulty and possible alignment solutions, I’ll use this as my jumping-off point.

Separately, I think it would be cool if a bunch of people got together and played this game for a while and wrote up the results:
like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there’s a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you’re forced into exotic and unlikely training data, and you win if i’m either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.
- Vivek Hebbar 13 Oct 2024 0:14 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Do you want to try playing this game together sometime?
  - Daniel Kokotajlo 13 Oct 2024 3:37 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?
    - Ben Pace 14 Oct 2024 22:07 UTC
      LW: 4 AF: 4
      0
      AF Parent
      Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)