Matthew Barnett comments on What is it to solve the alignment problem? (Notes)

Matthew Barnett Aug 30, 2024, 9:39 PM
10 points
−2
I’m confused about the clarifications in this post. Generally speaking, I think the terms “alignment”, “takeover”, and “disempowered” are vague and can mean dramatically different things to different people. My hope when I started reading this post was to see you define these terms precisely and unambiguously. Unfortunately, I am still confused about how you are using these terms, although it could very easily be my fault for not reading carefully enough.
Here is a scenario that I want you to imagine that I think might help to clarify where I’m confused:
Suppose we grant AIs legal rights and they become integrated into our society. Humans continue to survive and thrive, but AIs eventually and gradually accumulate the vast majority of the wealth, political power, and social status in society through lawful means. These AIs are sentient, extremely competent, mostly have strange and alien-like goals, and yet are considered “people” by most humans, according to an expansive definition of that word. Importantly, they are equal in the eyes of the law, and have no limitations on their ability to hold office, write new laws, and hold other positions of power. The AIs are agentic, autonomous, plan over long time horizons, and are not enslaved to the humans in any way. Moreover, many humans also upload themselves onto computers and become AIs themselves. These humans expand their own cognition and often choose to drop the “human” label from their personal identity after they are uploaded.
Here are my questions
- Does this scenario count as “AI takeover” according to you? Was it a “bad takeover”?
- Are the AIs “aligned” in this scenario?
- Are the humans “disempowered” in this scenario?
- Was this a good or bad outcome for humanity?
- Joe Carlsmith Sep 19, 2024, 8:09 PM
  5 points
  1
  Parent
  Hi Matthew—I agree it would be good to get a bit more clarity here. Here’s a first pass at more specific definitions.
  - AI takeover: any scenario in which AIs that aren’t directly descended from human minds (e.g. human brain emulations don’t count) end up with most of the power/resources.
    If humans end up with small amounts of power, this can still be a takeover, even if it’s pretty great by various standard human lights.
  - Bad AI takeover: any AI takeover in which it’s either the case that (a) the AIs takeover via a method that strongly violates current human cooperative norms (e.g. breaking laws, violence), and/or (b) the future ends up very low in value.
    In principle we try to talk separately about cases where (a) is true but (b) is false, and vice versa (see e.g. my post here). E.g. we could use “uncooperative takeovers” for (a), and “bad-future takeovers” for (b). But given that we want to avoid both (a) and (b), I think it’s OK to lump them together. But open to changing my mind on this, and I think your comments push me a bit in that direction.
  - Alignment: this term does indeed get used in tons of ways, and it’s probably best defined relative to some specific goal for the AI’s motivations—e.g., an AI is aligned to a principal, to a model spec, etc. That said, I think I mostly use it to mean “the AI in fact does not seek power in problematic ways, given the options available to it”—what I’ve elsewhere called “practically PS-aligned.” E.g., the AI does not choose a “problematic power-seeking” option in the sort of framework I described here, where I’m generally thinking of a paradigm problematic power-seeking option as one aimed at bad takeover.
  On these definitions, the scenario you’ve given is underspecified in a few respects. In particular, I’d want to know:
  1. How much power do the human descended AIs—i.e., the ems—end up with?
  2. Are the strange alien goals the AIs are pursuing such that I would ultimately think they yield outcomes very low in value when achieved, or not?
  If we assume the answer to (1) is that the non-human-descended AIs end up with most of the power (sounds this is basically what you had in mind—see also my “people-who-like paperclips” scenario here) then yes I’d want to call this a takeover and I’d want to say that humans have been disempowered. Whether it was a “bad takeover”, and whether this was a good or bad outcome for humanity, I think depends partly on (2). If in fact this scenario results in a future that is extremely low in value, in virtue of the alien-ness of the goals the AIs are pursuing, then I’d want to call it a bad takeover despite the cooperativeness of the path getting there. I think this would also imply that the AIs are practically PS-misaligned, and I think I endorse this implication, despite the fact that they are broadly cooperative and law-abiding—though I do see a case for reserving “PS-misalignment” specifically for uncooperative power-seeking. If the resulting future is high in value, then I’d say that it was not a bad takeover and that the AIs are aligned.
  Does that help? As I say, I think your comments here are pushing me a bit towards focusing specifically on uncooperative takeovers, and on defining PS-misalignment specifically in terms of AIs with a tendency to engage in uncooperative forms of power-seeking. If we went that route, then we wouldn’t need to answer my question (2) above, and we could just say that this is a non-bad takeover and that the AIs are PS-aligned.