Steven Byrnes comments on Many arguments for AI x-risk are wrong

Steven Byrnes 6 Mar 2024 14:04 UTC
LW: 18 AF: 15
22
AF
I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.
What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.
When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.
For more related discussion see my posts “Thoughts on “AI is easy to control” by Pope & Belrose” sections 3 & 4, and Heritability, Behaviorism, and Within-Lifetime RL.
- Wei Dai 6 Mar 2024 20:28 UTC
  LW: 4 AF: 4
  1
  AF Parent
  Yeah, this makes sense, thanks. I think I’ve read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)
- Signer 7 Mar 2024 7:34 UTC
  3 points
  0
  Parent
  
  The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.
  
  How is this very different from RLHF?
  - Steven Byrnes 7 Mar 2024 13:31 UTC
    6 points
    0
    Parent
    In RLHF, if you want the AI to do X, then you look at the two options and give a give thumbs-up to the one where it’s doing more X rather than less X. Very straightforward!
    By contrast, if the MTurkers want AlphaZero-MTurk to do X, then they have their work cut out. Their basic strategy would have to be: Wait for AlphaZero-MTurk to do X, and then immediately throw the game (= start deliberately making really bad moves). But there are a bunch of reasons that might not work well, or at all: (1) if AlphaZero-MTurk is already in a position where it can definitely win, then the MTurkers lose their ability to throw the game (i.e., if they start making deliberately bad moves, then AlphaZero-MTurk would have its win probability change from ≈100% to ≈100%), (2) there’s a reward-shaping challenge (i.e., if AlphaZero-MTurk does something close to X but not quite X, should you throw the game or not? I guess you could start playing slightly worse, in proportion to how close the AI is to doing X, but it’s probably really hard to exercise such fine-grained control over your move quality), (3) If X is a time-extended thing as opposed to a single move (e.g. “X = playing in a conservative style” or whatever), then what are you supposed to do? (4) Maybe other things too.