Steven Byrnes comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Steven Byrnes 15 Nov 2023 19:14 UTC
LW: 5 AF: 5
1
AF
(Sorry in advance if this whole comment is stupid, I only read a bit of the report.)
As context, I think the kind of technical plan where we reward the AI for (apparently) being helpful is at least not totally doomed to fail. Maybe I’m putting words in people’s mouths, but I think even some pretty intense doomers would agree with the weak statement “such a plan might turn out OK for all we know” (e.g. Abram Demski talking about a similar situation here, Nate Soares describing a similar-ish thing as “maybe one nine” a.k.a. a mere 90% chance that it would fail). Of course, I would rather have a technical plan for which there’s a strong reason to believe it would actually work. :-P
Anyway, if that plan had a catastrophic safety failure (assuming proper implementation etc., and also assuming a situationally-aware AI), I think I would bet on a goal misgeneralization failure mode over a “schemer” failure mode. Specifically, such an AI could plausibly (IMO) wind up feeling motivated by any combination of “the idea of getting a reward signal”, or “the idea of the human providing a reward signal”, or “the idea of the human feeling pleased and motivated to provide a reward signal”, or “the idea of my output having properties X,Y,Z (which make it similar to outputs that have been rewarded in the past)”, or whatever else. None of those possible motivations would require “scheming”, if I understand that term correctly, because in all cases the AI would generally be doing things during training that it was directly motivated to do (as opposed to only instrumentally motivated). But some of those possible motivations are really bad because they would make the AI think that escaping from the box, launching a coup, etc., would be an awesome idea, given the opportunity.
(Incidentally, I’m having trouble fitting that concern into the Fig. 1 taxonomy. E.g. an AI with a pure wireheading motivation (“all I want is for the reward signal to be high”) is intrinsically motivated to get reward each episode as an end in itself, but it’s also intrinsically motivated to grab power given an opportunity to do so. So would such an AI be a “reward-on-the-episode seeker” or a “schemer”? Or both?? Sorry if this is a stupid question, I didn’t read the whole report.)
- Joe Carlsmith 27 Nov 2023 22:44 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Agents that end up intrinsically motivated to get reward on the episode would be “terminal training-gamers/reward-on-the-episode seekers,” and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on “non-schemers with schemer-like traits”), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.
- Rubi J. Hudson 18 Nov 2023 2:02 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I don’t find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be “give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3”. If you consider this all part of the training process (and I think that’s a fair characterization), model that starts with goal misgeneralization quickly becomes a schemer too.