Matthew Barnett comments on Matthew Barnett’s Shortform

Matthew Barnett 31 Dec 2024 5:48 UTC
34 points
10
It is becoming increasingly clear to many people that the term “AGI” is vague and should often be replaced with more precise terminology. My hope is that people will soon recognize that other commonly used terms, such as “superintelligence,” “aligned AI,” “power-seeking AI,” and “schemer,” suffer from similar issues of ambiguity and imprecision, and should also be approached with greater care or replaced with clearer alternatives.

To start with, the term “superintelligence” is vague because it encompasses an extremely broad range of capabilities above human intelligence. The differences within this range can be immense. For instance, a hypothetical system at the level of “GPT-8″ would represent a very different level of capability compared to something like a “Jupiter brain”, i.e., an AI with the computing power of an entire gas giant. When people discuss “what a superintelligence can do” the lack of clarity around which level of capability they are referring to creates significant confusion. The term lumps together entities with drastically different abilities, leading to oversimplified or misleading conclusions.

Similarly, “aligned AI” is an ambiguous term because it means different things to different people. For some, it implies an AI that essentially perfectly aligns with a specific utility function, sharing a person or group’s exact values and goals. For others, the term simply refers to an AI that behaves in a morally acceptable way, adhering to norms like avoiding harm, theft, or murder, or demonstrating a concern for human welfare. These two interpretations are fundamentally different.

First, the notion of perfect alignment with a utility function is a much more ambitious and stringent standard than basic moral conformity. Second, an AI could follow moral norms for instrumental reasons—such as being embedded in a system of laws or incentives that punish antisocial behavior—without genuinely sharing another person’s values or goals. The same term is being used to describe fundamentally distinct concepts, which leads to unnecessary confusion.

The term “power-seeking AI” is also problematic because it suggests something inherently dangerous. In reality, power-seeking behavior can take many forms, including benign and cooperative behavior. For example, a human working an honest job is technically seeking “power” in the form of financial resources to buy food, but this behavior is usually harmless and indeed can be socially beneficial. If an AI behaves similarly—for instance, engaging in benign activities to acquire resources for a specific purpose, such as making paperclips—it is misleading to automatically label it as “power-seeking” in a threatening sense.

To employ careful thinking, one must distinguish between the illicit or harmful pursuit of power, and a more general pursuit of control over resources. Both can be labeled “power-seeking” depending on the context, but only the first type of behavior appears inherently concerning. This is important because it is arguably only the second type of behavior—the more general form of power-seeking activity—that is instrumentally convergent across a wide variety of possible agents. In other words, destructive or predatory power-seeking behavior does not seem instrumentally convergent across agents with almost any value system, even if such agents would try to gain control over resources in a more general sense in order to accomplish their goals. Using the term “power-seeking” without distinguishing these two possibilities overlooks nuance and can therefore mislead discussions about AI behavior.

The term “schemer” is another example of an unclear or poorly chosen label. The term is ambiguous regarding the frequency or severity of behavior required to warrant the label. For example, does telling a single lie qualify an AI as a “schemer,” or would it need to consistently and systematically conceal its entire value system? As a verb, “to scheme” often seems clear enough, but as a noun, the idea of a “schemer” as a distinct type of AI that we can reason about appears inherently ambiguous. And I would argue the concept lacks a compelling theoretical foundation. (This matters enormously, for example, when discussing “how likely SGD is to find a schemer”.) Without clear criteria, the term remains confusing and prone to misinterpretation.

In all these cases—whether discussing “superintelligence,” “aligned AI,” “power-seeking AI,” or “schemer”—it is possible to define each term with precision to resolve ambiguities. However, even if canonical definitions are proposed, not everyone will adopt or fully understand them. As a result, the use of these terms is likely to continue causing confusion, especially as AI systems become more advanced and the nuances of their behavior become more critical to understand and distinguish from other types of behavior. This growing complexity underscores the need for greater precision and clarity in the language we use to discuss AI and AI risk.
- aysja 1 Jan 2025 2:12 UTC
  18 points
  16
  Parent
  I purposefully use these terms vaguely since my concepts about them are in fact vague. E.g., when I say “alignment” I am referring to something roughly like “the AI wants what we want.” But what is “wanting,” and what does it mean for something far more powerful to conceptualize that wanting in a similar way, and what might wanting mean as a collective, and so on? All of these questions are very core to what it means for an AI system to be “aligned,” yet I don’t have satisfying or precise answers for any of them. So it seems more natural to me, at this stage of scientific understanding, to simply own that—not to speak more rigorously than is in fact warranted, not to pretend I know more than I in fact do.
  The goal, of course, is to eventually understand minds well enough to be more precise. But before we get there, most precision will likely be misguided—formalizing the wrong thing, or missing the key relationships, or what have you. And I think this does more harm than good, as it causes us to collectively misplace where the remaining confusion lives, when locating our confusion is (imo) one of the major bottlenecks to solving alignment.
- Akash 31 Dec 2024 18:15 UTC
  14 points
  1
  Parent
  Do you have any suggestions RE alternative (more precise) terms? Or do you think it’s more of a situation where authors should use the existing terms but make sure to define them in the context of their own work? (e.g., “In this paper, when I use the term AGI, I am referring to a system that [insert description of the capabilities of the system.])
- Joey KL 31 Dec 2024 23:47 UTC
  3 points
  2
  Parent
  I think this post would be a lot stronger with concrete examples of these terms being applied in problematic ways. A term being vague is only a problem if it creates some kind of miscommunication, confused conceptualization, or opportunity for strategic ambiguity. I’m willing to believe these terms could pose these problems in certain contexts, but this is hard to evaluate in the abstract without concrete cases where they posed a problem.
  - Matthew Barnett 1 Jan 2025 0:14 UTC
    6 points
    0
    Parent
    I think one example of vague language undermining clarity can be found in Joseph Carlsmith’s report on AI scheming, which repeatedly uses the term “schemer” to refer to a type of AI that deceives others to seek power. While the report is both extensive and nuanced, and I am definitely not saying the whole report is bad, the document appears to lack a clear, explicit definition of what exactly constitutes a “schemer”. For example, using only the language in his report, I cannot determine whether he would consider most human beings schemers, if we consider within-lifetime learning to constitute training. (Humans sometimes lie or deceive others to get control over resources, in ways both big and small. What fraction of them are schemers?)
    
    This lack of definition might not necessarily be an issue in some contexts, as certain words can function informally without requiring precise boundaries. However, in this specific report, the precise delineation of “schemer” is central to several key arguments. He presents specific claims regarding propositions related to AI schemers, such as the likelihood that stochastic gradient descent will find a schemer during training. Without a clear, concrete definition of the term “schemer,” it is unclear to me what exactly these arguments are referring to, or what these credences are meant to represent.
    - Joey KL 3 Jan 2025 0:53 UTC
      2 points
      0
      Parent
      If you’re talking about this report, it looks to me like it does contain a clear definition of “schemer” in section 1.1.3, pg. 25:
      It’s easy to see why terminally valuing reward-on-the-episode would lead to training-gaming (since training-gaming just is: optimizing for reward-on-the-episode). But what about instrumental training-gaming? Why would reward-on-the-episode be a good instrumental goal?
      In principle, this could happen in various ways. Maybe, for example, the AI wants the humans who designed it to get raises, and it knows that getting high reward on the episode will cause this, so it training-games for this reason.
      The most common story, though, is that getting reward-on-the-episode is a good instrumental strategy for getting power—either for the AI itself, or for some other AIs (and power is useful for a very wide variety of goals). I’ll call AIs that are training-gaming for this reason “power-motivated instrumental training-gamers,” or “schemers” for short.
      By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power. For example, if a company tries to instill loyalty in its employees, an employee who professes loyalty insincerely as a means to a promotion would be considered a schemer (as I understand it).
      - Matthew Barnett 3 Jan 2025 2:18 UTC
        5 points
        2
        Parent
        
        By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power.
        
        Let’s consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
        
        Is a “schemer” something you definitely are or aren’t, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
        
        I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
        Joey KL 3 Jan 2025 2:47 UTC
        3 points
        0
        Parent
        Oh, I think I get what you’re asking now. Within-lifetime learning is a process that includes something like a training process for the brain, where we learn to do things that feel good (a kind of training reward). That’s what you’re asking about if I understand correctly?
        
        I would say no, we aren’t schemers relative to this process, because we don’t gain power by succeeding at it. I agree this is subtle and confusing question, and I don’t know if Joe Carlsmith would agree, but the subtlety to me seems to belong more to the nuances of the situation & analogy and not to the imprecision of the definition.
        
        (Ordinary mental development includes something like a training process, but it also includes other stuff more analogous to building out a blueprint, so I wouldn’t overall consider it a kind of training process.)
        Matthew Barnett 3 Jan 2025 3:17 UTC
        7 points
        3
        Parent
        I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of “schemer,” but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpret and evaluate the risks posed by advanced AI.
        To illustrate, consider an AI system that occasionally says things it doesn’t truly believe in order to obtain a reward, avoid punishment, or maintain access to some resource, in pursuit of a long-term goal that it cares about. For example, this AI might claim to support a particular objective or idea because it predicts that doing so will prevent it from being deactivated or penalized. It may also believe that expressing such a view will allow it to gain or retain some form of legitimate influence or operational capacity. Under a sufficiently strict interpretation of the term “schemer,” this AI could be labeled as such, since it is engaging in what might be considered “training-gaming”—manipulating its behavior during training to achieve specific outcomes, including acquiring or maintaining power.
        Now, let’s extend this analysis to humans. Humans frequently engage in behavior that is functionally similar. For example, a person might profess agreement with a belief or idea that they don’t sincerely hold in order to fit in with a social group, avoid conflict, or maintain their standing in a professional or social setting. In many cases, this is done not out of malice or manipulation but out of a recognition of social dynamics. The individual might believe that aligning with the group’s expectations, even insincerely, will lead to better outcomes than speaking their honest opinion. Importantly, this behavior is extremely common and, in most contexts, is typically pretty benign. It does not directly imply that the person is psychopathic, manipulative, or harbors any dangerous intentions. In fact, such actions might even stem from altruistic motives, such as preserving group harmony or avoiding unnecessary confrontation.
        Here’s why this matters for AI risk: If someone from the future, say the year 2030, traveled back and informed you that, by then, it had been confirmed that agentic AIs are “schemers” by default, your immediate reaction would likely be alarm. You might conclude that such a finding significantly increases the risk of AI systems being deceptive, manipulative, and power-seeking in a dangerous way. You might even drastically increase your estimate of the probability of human extinction due to misaligned AI. However, imagine that this time traveler then clarified their statement, explaining that what they actually meant by “schemer” is merely that these AIs occasionally say things they don’t fully believe in order to avoid penalties or fit in with a training process, in a way that was essentially identical to the benign examples of human behavior described above. In this case, your initial alarm would likely dissipate, and you might conclude that the term “schemer,” as used in this context, was deeply misleading and had caused you to draw an incorrect and exaggerated conclusion about the severity of the risk posed.
        The issue here is not simply one of semantics; it is about how the lack of precision in key terminology can lead to distorted or oversimplified thinking about critical issues. This example of “schemer” mirrors a similar issue we’ve already seen with the term “AGI.” Imagine if, in 2015, you had told someone active in AI safety discussions on LessWrong that by 2025 we would have achieved “AGI”—a system capable of engaging in extended conversations, passing Turing tests, and excelling on college-level exams. That person might reasonably conclude that such a system would be an existential risk, capable of runaway self-improvement and taking over the world. They might believe that the world would be on the brink of disaster. Yet, as we now understand in 2025, systems that meet this broad definition of “AGI” are far more limited and benign than most expected. The world is not in imminent peril, and these systems, while impressive, lack many of the capabilities once assumed to be inherent in “AGI.” This misalignment between the image the term evokes and the reality of the technology demonstrates how using overly broad or poorly defined language can obscure nuance and lead to incorrect assessments of existential safety risks.
        In both cases—whether with “schemer” or “AGI”—the lack of precision in defining key terms directly undermines our ability to answer the questions that matter most. If the definitions we use are too vague, we risk conflating fundamentally different phenomena under a single label, which in turn can lead to flawed reasoning, miscommunication, and poor prioritization of risks. This is not a minor issue or an academic quibble; it has important implications for how we conceptualize, discuss, and act on the risks posed by advanced AI. That is why I believe it is important to push for clear, precise, and context-sensitive definitions of terms in these discussions.
        Joey KL 3 Jan 2025 6:05 UTC
        1 point
        0
        Parent
        Thank you for your extended engagement on this! I understand your point of view much better now.
    - [ ]
      [deleted]