Ultimately I think you’ve only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme.
It’s worth noting here that Carlsmith’s original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar.
So, AIs lying because this was directly reinforced wouldn’t itself be scheming behavior in Carlsmith’s terminology.
However, it’s worth noting that part of Carlsmith’s argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.
Perhaps I was being too loose with my language, and it’s possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here’s Joe Carlsmith’s terminological breakdown:
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not. [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a “schemer” given Carlsmith’s terminology, and common sense.]
If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a “schemer”. In Ajeya Cotra’s story, for example:
Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren’t actually good, i.e. Alex was “lying because this was directly reinforced”. She wrote, “Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively.”
Alex was “playing the training game”, as Ajeya Cotra says this explicitly several times in her story.
Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
Alex kind of didn’t appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I’m wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya’s story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of “the AI is lying because this was directly reinforced”.
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.
Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
Overall, I use the term to mean basically the same thing as “deceptive alignment”. (But more specifically pointing the definition in Joe’s report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)
Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here’s what I think is a clearer argument:
The term “schemer” evokes an image of someone who is lying to obtain power. It doesn’t particularly evoke a backstory for why the person became a liar in the first place.
There are at least two ways that AIs could arise that lie in order to obtain power:
The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
In both cases, one can imagine the AI eventually “playing the training game”, in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
For example, Nora and Quintin started their post with, “AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests.” This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
Separately, I am currently confused at what it means for a behavior to be “directly reinforced” by a reward function, so I’m not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
I agree this matches typical usage (and also matches usage in the overall post we’re commenting on), but sadly the word schemer in the context of Joe’s report means something more specific. I’m sad about the overall terminology situation here. It’s possible I should just always use a term like beyond-episode-goal-style-scheming.
I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn’t due to things well described as beyond-episode-goals, it should be much easier to study. See here for more commentary. There will of course be a spectrum here.
It’s worth noting here that Carlsmith’s original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar.
So, AIs lying because this was directly reinforced wouldn’t itself be scheming behavior in Carlsmith’s terminology.
However, it’s worth noting that part of Carlsmith’s argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.
Perhaps I was being too loose with my language, and it’s possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here’s Joe Carlsmith’s terminological breakdown:
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.[ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a “schemer” given Carlsmith’s terminology, and common sense.]If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a “schemer”. In Ajeya Cotra’s story, for example:
Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren’t actually good, i.e. Alex was “lying because this was directly reinforced”. She wrote, “Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively.”
Alex was “playing the training game”, as Ajeya Cotra says this explicitly several times in her story.
Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
Alex kind of didn’t appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I’m wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya’s story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of “the AI is lying because this was directly reinforced”.
Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
Overall, I use the term to mean basically the same thing as “deceptive alignment”. (But more specifically pointing the definition in Joe’s report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)
I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here’s what I think is a clearer argument:
The term “schemer” evokes an image of someone who is lying to obtain power. It doesn’t particularly evoke a backstory for why the person became a liar in the first place.
There are at least two ways that AIs could arise that lie in order to obtain power:
The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
In both cases, one can imagine the AI eventually “playing the training game”, in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
For example, Nora and Quintin started their post with, “AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests.” This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
Separately, I am currently confused at what it means for a behavior to be “directly reinforced” by a reward function, so I’m not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.
I agree this matches typical usage (and also matches usage in the overall post we’re commenting on), but sadly the word schemer in the context of Joe’s report means something more specific. I’m sad about the overall terminology situation here. It’s possible I should just always use a term like beyond-episode-goal-style-scheming.
I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn’t due to things well described as beyond-episode-goals, it should be much easier to study. See here for more commentary. There will of course be a spectrum here.
I think in Ajeya’s story the core threat model isn’t well described as scheming and is better described as seeking some proxy of reward.