Matthew Barnett comments on Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett 16 May 2024 0:58 UTC
LW: 61 AF: 22
35
AF
I think the main reason why we won’t align AGIs to some abstract conception of “human values” is because users won’t want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren’t looking for charity when they sign up for ChatGPT, and they wouldn’t be interested in signing up for such a service. They’re just looking for an AI that helps them do whatever they personally want.
In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like “an economy of AIs who (primarily) serve humans” rather than “a monolithic AGI that does stuff for the world (for good or ill)”. The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.
What links here?
- sunwillrise's comment on Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd (12 Jul 2024 15:34 UTC; 6 points)
- Seth Herd 16 May 2024 1:51 UTC
  16 points
  12
  Parent
  I very much agree. Part of why I wrote that post was that this is a common assumption, yet much of the discourse ignores it and addresses value alignment. Which would be better if we could get it, but it seems wildly unrealistic to expect us to try.
  
  The pragmatics of creating AGI for profit are a powerful reason to aim for instruction-following instead of value alignment; to the extent it will actually be safer and work better, that’s just one more reason that we should be thinking about that type of alignment. Not talking about it won’t keep it from taking that path.
  - RussellThor 16 May 2024 2:20 UTC
    4 points
    0
    Parent
    I think value alignment will be expected/enforced as a negative to some extent. E.g. don’t do something obviously bad (many such things are illegal anyway) and I expect that constraint to get tighter. That could give some kind of status quo bias on what AI tools are allowed to do also as an unknown new thing could be bad or seen as bad.
    Already the AI could “do what I mean and check” a lot better. for coding tasks etc it will often do the wrong thing when it could clarify. I would like to see a confidence indicator that it knows what I want before it continues. I don’t want to guess how much to clarify which what I currently have to do—this wastes time and mental effort. You are right there will be commercial pressure to do something at least somewhat similar.
- Charlie Steiner 17 May 2024 18:45 UTC
  LW: 14 AF: 7
  10
  AF Parent
  Wow, that’s pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won’t because if a business built that AI it wouldn’t make as much money?
  - Matthew Barnett 17 May 2024 19:40 UTC
    LW: 3 AF: 2
    1
    AF Parent
    Yes, but I don’t consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that’s something I’ve already gotten used to.
    - Seth Herd 17 May 2024 20:49 UTC
      10 points
      8
      Parent
      That would be fine by me if it were a stable long-term situation, but I don’t think it is. It sounds like you’re thinking mostly of AI and not AGI that can self-improve at some point. My major point in this post is that the same logic about following human instructiosn applies to AGI, but that’s vastly more dangerous to have proliferate. There won’t have to be many RSI-capable AGIs before someone tells their AGI “figure out how to take over the world and turn it into my utopia, before some other AGI turns it into theirs”. It seems like the game theory will resemble the nuclear standoff, but without the mutually assured destruction aspect that prevents deployment. The incentives will be to be the first mover to prevent others from deploying AGIs in ways you don’t like.
      - Matthew Barnett 17 May 2024 21:34 UTC
        8 points
        2
        Parent
        It sounds like you’re thinking mostly of AI and not AGI that can self-improve at some point
        I think you can simply have an economy of arbitrarily powerful AGI services, some of which contribute to R&D in a way that feeds into the entire development process recursively. There’s nothing here about my picture that rejects general intelligence, or R&D feedback loops.
        My guess is that the actual disagreement here is that you think that at some point a unified AGI will foom and take over the world, becoming a centralized authority that is able to exert its will on everything else without constraint. I don’t think that’s likely to happen. Instead, I think we’ll see inter-agent competition and decentralization indefinitely (albeit with increasing economies of scale, prompting larger bureaucratic organizations, in the age of AGI).
        Here’s something I wrote that seems vaguely relevant, and might give you a sense as to what I’m imagining,
        Given that we are already seeing market forces shaping the values of existing commercialized AIs, it is confusing to me why an EA would assume this fact will at some point no longer be true. To explain this, my best guess is that many EAs have roughly the following model of AI development:
        There is “narrow AI”, which will be commercialized, and its values will be determined by market forces, regulation, and to a limited degree, the values of AI developers. In this category we find GPT-4 from OpenAI, Gemini from Google, and presumably at least a few future iterations of these products.
        Then there is “general AI”, which will at some point arrive, and is qualitatively different from narrow AI. Its values will be determined almost solely by the intentions of the first team to develop AGI, assuming they solve the technical problems of value alignment.
        My advice is that we should probably just drop the second step, and think of future AI as simply continuing from the first step indefinitely, albeit with AIs becoming incrementally more general and more capable over time.
        Seth Herd 17 May 2024 23:14 UTC
        4 points
        0
        Parent
        Thanks for engaging. I did read your linked post. I think you’re actually in the majority in your opinion on AI leading to a continuation and expansion of business as usual. I’ve long been curious about about this line of thinking; while it makes a good bit of sense to me for the near future, I become confused at the “indefinite” part of your prediction.
        When you say that AI continues from the first step indefinitely, it seems to me that you must believe one or more of the following:
        No one would ever tell their arbitrarily powerful AI to take over the world
        Even if it might succeed
        No arbitrarily powerful AI could succeed at taking over the world
        Even if it was willing to do terrible damage in the process
        We’ll have a limited number of humans controlling arbitrarily powerful AI
        And an indefinitely stable balance-of-power agreement among them
        By “indefinitely” you mean only until we create and proliferate really powerful AI
        If I believed in any of those, I’d agree with you.
        Or perhaps I’m missing some other belief we don’t share that leads to your conclusions.
        Care to share?
        Separately, in response to that post: your post you linked was titled AI values will be shaped by a variety of forces, not just the values of AI developers. In my prediction here, AI and AGI will not have values in any important sense; it will merely carry out the values of its principals (its creators, or the government that shows up to take control). This might just be terminological distinction, except for the following bit of implied logic: I don’t think AI needs to share clients’ values to be of immense economic and practical advantage to them. When (if) someone creates a highly capable AI system, they will instruct it to serve customers needs in certain ways, including following their requests within certain limits; this will not necessitate changing the A(G)I’s core values (if they exist) to use it to make enormous profits when licensed to clients. To the extent this is correct, we should go on assuming that AI will share or at least follow its creators’ values (or IMO more likely, take orders/values from the government that takes control, citing security concerns)
        Matthew Barnett 18 May 2024 0:41 UTC
        4 points
        0
        Parent
        No arbitrarily powerful AI could succeed at taking over the world
        This is closest to what I am saying. The current world appears to be in a state of inter-agent competition. Even as technology has gotten more advanced, and as agents have gotten powerful over time, no single unified agent has been able to obtain control over everything and win the entire pie, defeating all the other agents. I think we should expect this state of affairs to continue even as AGI gets invented and technology continues to get more powerful.
        (One plausible exception to the idea that “no single agent has ever won the competition over the world” is the human species itself, which dominates over other animal species. But I don’t think the human species is well-described as a unified agent, and I think our power comes mostly from accumulated technological abilities, rather than raw intelligence by itself. This distinction is important because the effects of technological innovation generally diffuse across society rather than giving highly concentrated powers to the people who invent stuff. This generally makes the situation with humans vs. animals disanalogous to a hypothetical AGI foom in several important ways.)
        Separately, I also think that even if an AGI agent could violently take over the world, it would likely not be rational for it to try, due to the fact that compromising with the rest of the world would be a less risky and more efficient way of achieving its goals. I’ve written about these ideas in a shortform thread here.
        Seth Herd 19 May 2024 1:50 UTC
        2 points
        0
        Parent
        I read your linked shortform thread. I agreed with pretty most of your arguments against some common AGI takeover arguments. I agree that they won’t coordinate against us and won’t have “collective grudges” against us.
        
        But I don’t think the arguments for continued stability are very thorough, either. I think we just don’t know how it will play out. And I think there’s a reason to be concerned that takeover will be rational for AGIs, where it’s not for humans.
        
        The central difference in logic is the capacity for self-improvement. In your post, you addressed self-improvement by linking a Christiano piece on slow takeoff. But he noted at the start that he wasn’t arguing against self-improvement, only that the pace of self improvement would be more modest. But the potential implications for a balance of power in the world remain.
        
        Humans are all locked to a similar level of cognitive and physical capabilities. That has implications for game theory where all of the competitors are humans. Cooperation often makes more sense for humans. But the same isn’t necessarily true of AGI. Their cognitive and physical capacities can potentially be expanded on. So it’s (very loosely) like the difference between game theory in chess, and chess where one of the moves is to add new capabilities to your pieces. We can’t learn much about the new game from theory of the old, particularly if we don’t even know all of the capabilities that a player might add to their pieces.
        
        More concretely: it may be quite rational for a human controlling an AGI to tell it to try to self-improve and develop new capacities, strategies and technologies to potentially take over the world. With a first-mover advantage, such a takeover might be entirely possible. Its capacities might remain ahead of the rest of the world’s AI/AGIs if they hadn’t started to aggressively self-improve and develop the capacities to win conflicts. This would be particularly true if the aggressor AGI was willing to cause global catastrophe (e.g., EMPs, bringing down power grids).
        
        The assumption of a stable balance of power in the face of competitors that can improve their capacities in dramatic ways seems unlikely to be true by default, and at the least, worthy of close inspection. Yet I’m afraid it’s the default assumption for many.
        
        Your shortform post is more on-topic for this part of the discussion, so I’m copying this comment there and will continue there if you want. It’s worth more posts; I hope to write one myself if time allows.
        
        Edit: It looks like there’s an extensive discussion there, including my points here, so I won’t bother copying this over. It looked like the point about self-improvement destabilizing the situation had been raised but not really addressed. So I continue to think it needs more thought before we accept a future that includes proliferation of AGI capable of RSI.
- agazi 18 May 2024 16:53 UTC
  LW: 4 AF: 1
  0
  AF Parent
  I think we can already see the early innings of this with large API providers figuring out how to calibrate post-training techniques (RHLF, constitutional AI) between economic usefulness and the “mean” of western morals. Tough to go against economic incentives
  - Seth Herd 19 May 2024 1:02 UTC
    3 points
    0
    Parent
    Yes, we do see such “values” now, but that’s a separate issue IMO.
    
    There’s an interesting thing happening in which we’re mixing discussions of AI safety and AGI x-risk. There’s no sharp line, but I think they are two importantly different things. This post was intended to be about AGI, as distinct from AI. Most of the economic and other concerns relative to the “alignment” of AI are not relevant to the alignment of AGI.
    
    This thesis could be right or wrong, but let’s keep it distinct from theories about AI in the present and near future. My thesis here (and a common thesis) is that we should be most concerned about AGI that is an entity with agency and goals, like humans have. AI as a tool is a separate thing. It’s very real and we should be concerned with it, but not let it blur into categorically distinct, goal-directed, self-aware AGI.
    
    Whether or not we actually get such AGI is an open question that should be debated, not assumed. I think the answer is very clearly that we will, and soon; as soon as tool AI is smart enough, someone will make it agentic, because agents can do useful work, and they’re interesting. So I think we’ll get AGI with real goals, distinct from the pseudo-goals implicit in current LLMs behavior.
    
    The post addresses such “real” AGI that is self-aware and agentic, but that has the sole goal of doing what people want is pretty much a third thing that’s somewhat counterintuitive.
    What links here?
    Seth Herd's comment on Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd (19 May 2024 1:09 UTC; 2 points)
- wassname 18 May 2024 9:09 UTC
  2 points
  1
  Parent
  When you rephrase this to be about search engines
  
  I think the main reason why we won’t censor search to some abstract conception of “community values” is because users won’t want to rent or purchase search services that are censor to such a broad target
  
  It doesn’t describe reality. Most of us consume search and recommendations that has been censored (e.g. removing porn, piracy, toxicity, racism, taboo politics) in a way that pus cultural values over our preferences or interests.
  
  So perhaps it won’t be true for AI either. At least in the near term, the line between AI and search is a blurred line, and the same pressures exist on consumers and providers.
  - Seth Herd 19 May 2024 1:09 UTC
    2 points
    0
    Parent
    In the near term AI and search are blurred, but that’s a separate topic. This post was about AGI as distinct from AI. There’s no sharp line between but there are important distinctions, and I’m afraid we’re confused as a group because of that blurring. More above, and it’s worth its own post and some sort of new clarifying terminology. The term AGI has been watered down to include LLMs that are fairly general, rather than the original and important meaning of AI that can think about anything, implying the ability to learn, and therefore almost necessarily to have explicit goals and agency. This was about that type of “real” AGI, which is still hypothetical even though increasingly plausible in the near term.
    - wassname 19 May 2024 3:57 UTC
      1 point
      0
      Parent
      That’s true, they are different. But search still provides the closest historical analogue (maybe employees/suppliers provide another). Historical analogues have the benefit of being empirical and grounded, so I prefer them over (or with) pure reasoning or judgement.
  - Matthew Barnett 18 May 2024 9:20 UTC
    2 points
    1
    Parent
    I also expect AIs to be constrained by social norms, laws, and societal values. But I think there’s a distinction between how AIs will be constrained and how AIs will try to help humans. Although it often censors certain topics, Google still usually delivers the results the user wants, rather than serving some broader social agenda upon each query. Likewise, ChatGPT is constrained by social mores, but it’s still better described as a user assistant, not as an engine for social change or as a benevolent agent that acts on behalf of humanity.
- [ ]
  [deleted]