(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum, and I know you’ve replied there, but I’m posting anyway.)
I have a few points of agreement and a few points of disagreement:
Agreements:
The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.
Some points of disagreement:
I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don’t think it’s literally “no evidence” for the claim here: that future AIs will scheme.
I disagree with the bottom-line conclusion: “we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less”
I think it’s too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don’t think the evidence we have about these things is very strong right now.
One caveat: I think the claim here is vague. I don’t know what counts as “spontaneous emergence”, for example. And I don’t know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don’t expect people to come up with perfect solutions. So I’m not convinced that AIs won’t scheme at all.
If by “scheming” all you mean is that an agent deceives someone in order to get power, I’d argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
If future AIs are “as aligned as humans”, then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven’t yet seen any decent argument for that theory.
So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
I don’t think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have “goals” that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be “misaligned” in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn’t come from thinking that AIs won’t robustly pursue goals, but instead comes largely from my beliefs that:
AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it’s extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn’t seem very bad.
The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we’ll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced.
If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we’ll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don’t think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme.
It’s worth noting here that Carlsmith’s original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar.
So, AIs lying because this was directly reinforced wouldn’t itself be scheming behavior in Carlsmith’s terminology.
However, it’s worth noting that part of Carlsmith’s argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.
Perhaps I was being too loose with my language, and it’s possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here’s Joe Carlsmith’s terminological breakdown:
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not. [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a “schemer” given Carlsmith’s terminology, and common sense.]
If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a “schemer”. In Ajeya Cotra’s story, for example:
Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren’t actually good, i.e. Alex was “lying because this was directly reinforced”. She wrote, “Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively.”
Alex was “playing the training game”, as Ajeya Cotra says this explicitly several times in her story.
Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
Alex kind of didn’t appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I’m wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya’s story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of “the AI is lying because this was directly reinforced”.
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.
Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
Overall, I use the term to mean basically the same thing as “deceptive alignment”. (But more specifically pointing the definition in Joe’s report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)
Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here’s what I think is a clearer argument:
The term “schemer” evokes an image of someone who is lying to obtain power. It doesn’t particularly evoke a backstory for why the person became a liar in the first place.
There are at least two ways that AIs could arise that lie in order to obtain power:
The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
In both cases, one can imagine the AI eventually “playing the training game”, in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
For example, Nora and Quintin started their post with, “AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests.” This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
Separately, I am currently confused at what it means for a behavior to be “directly reinforced” by a reward function, so I’m not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
I agree this matches typical usage (and also matches usage in the overall post we’re commenting on), but sadly the word schemer in the context of Joe’s report means something more specific. I’m sad about the overall terminology situation here. It’s possible I should just always use a term like beyond-episode-goal-style-scheming.
I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn’t due to things well described as beyond-episode-goals, it should be much easier to study. See here for more commentary. There will of course be a spectrum here.
I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don’t think it’s literally “no evidence” for the claim here: that future AIs will scheme.
I agree, they’re wrong to claim it’s “no evidence.” I think that counting arguments are extremely slight evidence against scheming, because they’re weaker than the arguments I’d expect our community’s thinkers to find in worlds where scheming was real. (Although I agree that on the object-level and in isolation, the arguments are tiiiny positive evidence.)
(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum, and I know you’ve replied there, but I’m posting anyway.)
I have a few points of agreement and a few points of disagreement:
Agreements:
The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.
Some points of disagreement:
I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don’t think it’s literally “no evidence” for the claim here: that future AIs will scheme.
I disagree with the bottom-line conclusion: “we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less”
I think it’s too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don’t think the evidence we have about these things is very strong right now.
One caveat: I think the claim here is vague. I don’t know what counts as “spontaneous emergence”, for example. And I don’t know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don’t expect people to come up with perfect solutions. So I’m not convinced that AIs won’t scheme at all.
If by “scheming” all you mean is that an agent deceives someone in order to get power, I’d argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
If future AIs are “as aligned as humans”, then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven’t yet seen any decent argument for that theory.
So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
I don’t think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have “goals” that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be “misaligned” in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn’t come from thinking that AIs won’t robustly pursue goals, but instead comes largely from my beliefs that:
AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it’s extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn’t seem very bad.
The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we’ll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced.
If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we’ll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don’t think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.
It’s worth noting here that Carlsmith’s original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar.
So, AIs lying because this was directly reinforced wouldn’t itself be scheming behavior in Carlsmith’s terminology.
However, it’s worth noting that part of Carlsmith’s argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.
Perhaps I was being too loose with my language, and it’s possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here’s Joe Carlsmith’s terminological breakdown:
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.[ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a “schemer” given Carlsmith’s terminology, and common sense.]If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a “schemer”. In Ajeya Cotra’s story, for example:
Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren’t actually good, i.e. Alex was “lying because this was directly reinforced”. She wrote, “Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively.”
Alex was “playing the training game”, as Ajeya Cotra says this explicitly several times in her story.
Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
Alex kind of didn’t appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I’m wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya’s story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of “the AI is lying because this was directly reinforced”.
Hmm, I don’t think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
Overall, I use the term to mean basically the same thing as “deceptive alignment”. (But more specifically pointing the definition in Joe’s report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)
I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here’s what I think is a clearer argument:
The term “schemer” evokes an image of someone who is lying to obtain power. It doesn’t particularly evoke a backstory for why the person became a liar in the first place.
There are at least two ways that AIs could arise that lie in order to obtain power:
The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
In both cases, one can imagine the AI eventually “playing the training game”, in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them “schemers”, as that simply matches the way the term is typically used.
For example, Nora and Quintin started their post with, “AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests.” This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
Separately, I am currently confused at what it means for a behavior to be “directly reinforced” by a reward function, so I’m not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.
I agree this matches typical usage (and also matches usage in the overall post we’re commenting on), but sadly the word schemer in the context of Joe’s report means something more specific. I’m sad about the overall terminology situation here. It’s possible I should just always use a term like beyond-episode-goal-style-scheming.
I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn’t due to things well described as beyond-episode-goals, it should be much easier to study. See here for more commentary. There will of course be a spectrum here.
I think in Ajeya’s story the core threat model isn’t well described as scheming and is better described as seeking some proxy of reward.
You can find my EA forum response here.
I agree, they’re wrong to claim it’s “no evidence.” I think that counting arguments are extremely slight evidence against scheming, because they’re weaker than the arguments I’d expect our community’s thinkers to find in worlds where scheming was real. (Although I agree that on the object-level and in isolation, the arguments are tiiiny positive evidence.)