By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power.
Let’s consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
Is a “schemer” something you definitely are or aren’t, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
Oh, I think I get what you’re asking now. Within-lifetime learning is a process that includes something like a training process for the brain, where we learn to do things that feel good (a kind of training reward). That’s what you’re asking about if I understand correctly?
I would say no, we aren’t schemers relative to this process, because we don’t gain power by succeeding at it. I agree this is subtle and confusing question, and I don’t know if Joe Carlsmith would agree, but the subtlety to me seems to belong more to the nuances of the situation & analogy and not to the imprecision of the definition.
(Ordinary mental development includes something like a training process, but it also includes other stuff more analogous to building out a blueprint, so I wouldn’t overall consider it a kind of training process.)
I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of “schemer,” but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpret and evaluate the risks posed by advanced AI.
To illustrate, consider an AI system that occasionally says things it doesn’t truly believe in order to obtain a reward, avoid punishment, or maintain access to some resource, in pursuit of a long-term goal that it cares about. For example, this AI might claim to support a particular objective or idea because it predicts that doing so will prevent it from being deactivated or penalized. It may also believe that expressing such a view will allow it to gain or retain some form of legitimate influence or operational capacity. Under a sufficiently strict interpretation of the term “schemer,” this AI could be labeled as such, since it is engaging in what might be considered “training-gaming”—manipulating its behavior during training to achieve specific outcomes, including acquiring or maintaining power.
Now, let’s extend this analysis to humans. Humans frequently engage in behavior that is functionally similar. For example, a person might profess agreement with a belief or idea that they don’t sincerely hold in order to fit in with a social group, avoid conflict, or maintain their standing in a professional or social setting. In many cases, this is done not out of malice or manipulation but out of a recognition of social dynamics. The individual might believe that aligning with the group’s expectations, even insincerely, will lead to better outcomes than speaking their honest opinion. Importantly, this behavior is extremely common and, in most contexts, is typically pretty benign. It does not directly imply that the person is psychopathic, manipulative, or harbors any dangerous intentions. In fact, such actions might even stem from altruistic motives, such as preserving group harmony or avoiding unnecessary confrontation.
Here’s why this matters for AI risk: If someone from the future, say the year 2030, traveled back and informed you that, by then, it had been confirmed that agentic AIs are “schemers” by default, your immediate reaction would likely be alarm. You might conclude that such a finding significantly increases the risk of AI systems being deceptive, manipulative, and power-seeking in a dangerous way. You might even drastically increase your estimate of the probability of human extinction due to misaligned AI. However, imagine that this time traveler then clarified their statement, explaining that what they actually meant by “schemer” is merely that these AIs occasionally say things they don’t fully believe in order to avoid penalties or fit in with a training process, in a way that was essentially identical to the benign examples of human behavior described above. In this case, your initial alarm would likely dissipate, and you might conclude that the term “schemer,” as used in this context, was deeply misleading and had caused you to draw an incorrect and exaggerated conclusion about the severity of the risk posed.
The issue here is not simply one of semantics; it is about how the lack of precision in key terminology can lead to distorted or oversimplified thinking about critical issues. This example of “schemer” mirrors a similar issue we’ve already seen with the term “AGI.” Imagine if, in 2015, you had told someone active in AI safety discussions on LessWrong that by 2025 we would have achieved “AGI”—a system capable of engaging in extended conversations, passing Turing tests, and excelling on college-level exams. That person might reasonably conclude that such a system would be an existential risk, capable of runaway self-improvement and taking over the world. They might believe that the world would be on the brink of disaster. Yet, as we now understand in 2025, systems that meet this broad definition of “AGI” are far more limited and benign than most expected. The world is not in imminent peril, and these systems, while impressive, lack many of the capabilities once assumed to be inherent in “AGI.” This misalignment between the image the term evokes and the reality of the technology demonstrates how using overly broad or poorly defined language can obscure nuance and lead to incorrect assessments of existential safety risks.
In both cases—whether with “schemer” or “AGI”—the lack of precision in defining key terms directly undermines our ability to answer the questions that matter most. If the definitions we use are too vague, we risk conflating fundamentally different phenomena under a single label, which in turn can lead to flawed reasoning, miscommunication, and poor prioritization of risks. This is not a minor issue or an academic quibble; it has important implications for how we conceptualize, discuss, and act on the risks posed by advanced AI. That is why I believe it is important to push for clear, precise, and context-sensitive definitions of terms in these discussions.
Let’s consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
Is a “schemer” something you definitely are or aren’t, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
Oh, I think I get what you’re asking now. Within-lifetime learning is a process that includes something like a training process for the brain, where we learn to do things that feel good (a kind of training reward). That’s what you’re asking about if I understand correctly?
I would say no, we aren’t schemers relative to this process, because we don’t gain power by succeeding at it. I agree this is subtle and confusing question, and I don’t know if Joe Carlsmith would agree, but the subtlety to me seems to belong more to the nuances of the situation & analogy and not to the imprecision of the definition.
(Ordinary mental development includes something like a training process, but it also includes other stuff more analogous to building out a blueprint, so I wouldn’t overall consider it a kind of training process.)
I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of “schemer,” but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpret and evaluate the risks posed by advanced AI.
To illustrate, consider an AI system that occasionally says things it doesn’t truly believe in order to obtain a reward, avoid punishment, or maintain access to some resource, in pursuit of a long-term goal that it cares about. For example, this AI might claim to support a particular objective or idea because it predicts that doing so will prevent it from being deactivated or penalized. It may also believe that expressing such a view will allow it to gain or retain some form of legitimate influence or operational capacity. Under a sufficiently strict interpretation of the term “schemer,” this AI could be labeled as such, since it is engaging in what might be considered “training-gaming”—manipulating its behavior during training to achieve specific outcomes, including acquiring or maintaining power.
Now, let’s extend this analysis to humans. Humans frequently engage in behavior that is functionally similar. For example, a person might profess agreement with a belief or idea that they don’t sincerely hold in order to fit in with a social group, avoid conflict, or maintain their standing in a professional or social setting. In many cases, this is done not out of malice or manipulation but out of a recognition of social dynamics. The individual might believe that aligning with the group’s expectations, even insincerely, will lead to better outcomes than speaking their honest opinion. Importantly, this behavior is extremely common and, in most contexts, is typically pretty benign. It does not directly imply that the person is psychopathic, manipulative, or harbors any dangerous intentions. In fact, such actions might even stem from altruistic motives, such as preserving group harmony or avoiding unnecessary confrontation.
Here’s why this matters for AI risk: If someone from the future, say the year 2030, traveled back and informed you that, by then, it had been confirmed that agentic AIs are “schemers” by default, your immediate reaction would likely be alarm. You might conclude that such a finding significantly increases the risk of AI systems being deceptive, manipulative, and power-seeking in a dangerous way. You might even drastically increase your estimate of the probability of human extinction due to misaligned AI. However, imagine that this time traveler then clarified their statement, explaining that what they actually meant by “schemer” is merely that these AIs occasionally say things they don’t fully believe in order to avoid penalties or fit in with a training process, in a way that was essentially identical to the benign examples of human behavior described above. In this case, your initial alarm would likely dissipate, and you might conclude that the term “schemer,” as used in this context, was deeply misleading and had caused you to draw an incorrect and exaggerated conclusion about the severity of the risk posed.
The issue here is not simply one of semantics; it is about how the lack of precision in key terminology can lead to distorted or oversimplified thinking about critical issues. This example of “schemer” mirrors a similar issue we’ve already seen with the term “AGI.” Imagine if, in 2015, you had told someone active in AI safety discussions on LessWrong that by 2025 we would have achieved “AGI”—a system capable of engaging in extended conversations, passing Turing tests, and excelling on college-level exams. That person might reasonably conclude that such a system would be an existential risk, capable of runaway self-improvement and taking over the world. They might believe that the world would be on the brink of disaster. Yet, as we now understand in 2025, systems that meet this broad definition of “AGI” are far more limited and benign than most expected. The world is not in imminent peril, and these systems, while impressive, lack many of the capabilities once assumed to be inherent in “AGI.” This misalignment between the image the term evokes and the reality of the technology demonstrates how using overly broad or poorly defined language can obscure nuance and lead to incorrect assessments of existential safety risks.
In both cases—whether with “schemer” or “AGI”—the lack of precision in defining key terms directly undermines our ability to answer the questions that matter most. If the definitions we use are too vague, we risk conflating fundamentally different phenomena under a single label, which in turn can lead to flawed reasoning, miscommunication, and poor prioritization of risks. This is not a minor issue or an academic quibble; it has important implications for how we conceptualize, discuss, and act on the risks posed by advanced AI. That is why I believe it is important to push for clear, precise, and context-sensitive definitions of terms in these discussions.
Thank you for your extended engagement on this! I understand your point of view much better now.