I found this post very helpful, thanks! If I find time to try to form a more gears-level independent impression about alignment difficulty and possible alignment solutions, I’ll use this as my jumping-off point.
Separately, I think it would be cool if a bunch of people got together and played this game for a while and wrote up the results:
like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there’s a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you’re forced into exotic and unlikely training data, and you win if i’m either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.
I found this post very helpful, thanks! If I find time to try to form a more gears-level independent impression about alignment difficulty and possible alignment solutions, I’ll use this as my jumping-off point.
Separately, I think it would be cool if a bunch of people got together and played this game for a while and wrote up the results:
Do you want to try playing this game together sometime?
Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?
Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)