ryan_greenblatt comments on Counting arguments provide no evidence for AI doom

ryan_greenblatt 28 Feb 2024 5:20 UTC
LW: 4 AF: 4
0
AF

The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.

Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

Overall, I use the term to mean basically the same thing as “deceptive alignment”. (But more specifically pointing the definition in Joe’s report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)
- Matthew Barnett 28 Feb 2024 19:22 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
  I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here’s what I think is a clearer argument:
  - The term “schemer” evokes an image of someone who is lying to obtain power. It doesn’t particularly evoke a backstory for why the person became a liar in the first place.
  - There are at least two ways that AIs could arise that lie in order to obtain power:
    The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
    The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
  - In both cases, one can imagine the AI eventually “playing the training game”, in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
  - Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
    
    For example, Nora and Quintin started their post with, “AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests.” This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
  - Separately, I am currently confused at what it means for a behavior to be “directly reinforced” by a reward function, so I’m not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.
  - ryan_greenblatt 28 Feb 2024 19:32 UTC
    LW: 2 AF: 2
    0
    AF Parent
    
    Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
    
    I agree this matches typical usage (and also matches usage in the overall post we’re commenting on), but sadly the word schemer in the context of Joe’s report means something more specific. I’m sad about the overall terminology situation here. It’s possible I should just always use a term like beyond-episode-goal-style-scheming.
    
    I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn’t due to things well described as beyond-episode-goals, it should be much easier to study. See here for more commentary. There will of course be a spectrum here.