Matthew Barnett comments on Matthew Barnett’s Shortform

Matthew Barnett 1 Jan 2025 0:14 UTC
6 points
0
I think one example of vague language undermining clarity can be found in Joseph Carlsmith’s report on AI scheming, which repeatedly uses the term “schemer” to refer to a type of AI that deceives others to seek power. While the report is both extensive and nuanced, and I am definitely not saying the whole report is bad, the document appears to lack a clear, explicit definition of what exactly constitutes a “schemer”. For example, using only the language in his report, I cannot determine whether he would consider most human beings schemers, if we consider within-lifetime learning to constitute training. (Humans sometimes lie or deceive others to get control over resources, in ways both big and small. What fraction of them are schemers?)

This lack of definition might not necessarily be an issue in some contexts, as certain words can function informally without requiring precise boundaries. However, in this specific report, the precise delineation of “schemer” is central to several key arguments. He presents specific claims regarding propositions related to AI schemers, such as the likelihood that stochastic gradient descent will find a schemer during training. Without a clear, concrete definition of the term “schemer,” it is unclear to me what exactly these arguments are referring to, or what these credences are meant to represent.
- Joey KL 3 Jan 2025 0:53 UTC
  2 points
  0
  Parent
  If you’re talking about this report, it looks to me like it does contain a clear definition of “schemer” in section 1.1.3, pg. 25:
  It’s easy to see why terminally valuing reward-on-the-episode would lead to training-gaming (since training-gaming just is: optimizing for reward-on-the-episode). But what about instrumental training-gaming? Why would reward-on-the-episode be a good instrumental goal?
  In principle, this could happen in various ways. Maybe, for example, the AI wants the humans who designed it to get raises, and it knows that getting high reward on the episode will cause this, so it training-games for this reason.
  The most common story, though, is that getting reward-on-the-episode is a good instrumental strategy for getting power—either for the AI itself, or for some other AIs (and power is useful for a very wide variety of goals). I’ll call AIs that are training-gaming for this reason “power-motivated instrumental training-gamers,” or “schemers” for short.
  By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power. For example, if a company tries to instill loyalty in its employees, an employee who professes loyalty insincerely as a means to a promotion would be considered a schemer (as I understand it).
  - Matthew Barnett 3 Jan 2025 2:18 UTC
    5 points
    2
    Parent
    
    By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power.
    
    Let’s consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
    
    Is a “schemer” something you definitely are or aren’t, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
    
    I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
    - Joey KL 3 Jan 2025 2:47 UTC
      3 points
      0
      Parent
      Oh, I think I get what you’re asking now. Within-lifetime learning is a process that includes something like a training process for the brain, where we learn to do things that feel good (a kind of training reward). That’s what you’re asking about if I understand correctly?
      
      I would say no, we aren’t schemers relative to this process, because we don’t gain power by succeeding at it. I agree this is subtle and confusing question, and I don’t know if Joe Carlsmith would agree, but the subtlety to me seems to belong more to the nuances of the situation & analogy and not to the imprecision of the definition.
      
      (Ordinary mental development includes something like a training process, but it also includes other stuff more analogous to building out a blueprint, so I wouldn’t overall consider it a kind of training process.)
      - Matthew Barnett 3 Jan 2025 3:17 UTC
        7 points
        3
        Parent
        I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of “schemer,” but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpret and evaluate the risks posed by advanced AI.
        To illustrate, consider an AI system that occasionally says things it doesn’t truly believe in order to obtain a reward, avoid punishment, or maintain access to some resource, in pursuit of a long-term goal that it cares about. For example, this AI might claim to support a particular objective or idea because it predicts that doing so will prevent it from being deactivated or penalized. It may also believe that expressing such a view will allow it to gain or retain some form of legitimate influence or operational capacity. Under a sufficiently strict interpretation of the term “schemer,” this AI could be labeled as such, since it is engaging in what might be considered “training-gaming”—manipulating its behavior during training to achieve specific outcomes, including acquiring or maintaining power.
        Now, let’s extend this analysis to humans. Humans frequently engage in behavior that is functionally similar. For example, a person might profess agreement with a belief or idea that they don’t sincerely hold in order to fit in with a social group, avoid conflict, or maintain their standing in a professional or social setting. In many cases, this is done not out of malice or manipulation but out of a recognition of social dynamics. The individual might believe that aligning with the group’s expectations, even insincerely, will lead to better outcomes than speaking their honest opinion. Importantly, this behavior is extremely common and, in most contexts, is typically pretty benign. It does not directly imply that the person is psychopathic, manipulative, or harbors any dangerous intentions. In fact, such actions might even stem from altruistic motives, such as preserving group harmony or avoiding unnecessary confrontation.
        Here’s why this matters for AI risk: If someone from the future, say the year 2030, traveled back and informed you that, by then, it had been confirmed that agentic AIs are “schemers” by default, your immediate reaction would likely be alarm. You might conclude that such a finding significantly increases the risk of AI systems being deceptive, manipulative, and power-seeking in a dangerous way. You might even drastically increase your estimate of the probability of human extinction due to misaligned AI. However, imagine that this time traveler then clarified their statement, explaining that what they actually meant by “schemer” is merely that these AIs occasionally say things they don’t fully believe in order to avoid penalties or fit in with a training process, in a way that was essentially identical to the benign examples of human behavior described above. In this case, your initial alarm would likely dissipate, and you might conclude that the term “schemer,” as used in this context, was deeply misleading and had caused you to draw an incorrect and exaggerated conclusion about the severity of the risk posed.
        The issue here is not simply one of semantics; it is about how the lack of precision in key terminology can lead to distorted or oversimplified thinking about critical issues. This example of “schemer” mirrors a similar issue we’ve already seen with the term “AGI.” Imagine if, in 2015, you had told someone active in AI safety discussions on LessWrong that by 2025 we would have achieved “AGI”—a system capable of engaging in extended conversations, passing Turing tests, and excelling on college-level exams. That person might reasonably conclude that such a system would be an existential risk, capable of runaway self-improvement and taking over the world. They might believe that the world would be on the brink of disaster. Yet, as we now understand in 2025, systems that meet this broad definition of “AGI” are far more limited and benign than most expected. The world is not in imminent peril, and these systems, while impressive, lack many of the capabilities once assumed to be inherent in “AGI.” This misalignment between the image the term evokes and the reality of the technology demonstrates how using overly broad or poorly defined language can obscure nuance and lead to incorrect assessments of existential safety risks.
        In both cases—whether with “schemer” or “AGI”—the lack of precision in defining key terms directly undermines our ability to answer the questions that matter most. If the definitions we use are too vague, we risk conflating fundamentally different phenomena under a single label, which in turn can lead to flawed reasoning, miscommunication, and poor prioritization of risks. This is not a minor issue or an academic quibble; it has important implications for how we conceptualize, discuss, and act on the risks posed by advanced AI. That is why I believe it is important to push for clear, precise, and context-sensitive definitions of terms in these discussions.
        Joey KL 3 Jan 2025 6:05 UTC
        1 point
        0
        Parent
        Thank you for your extended engagement on this! I understand your point of view much better now.
- [ ]
  [deleted]