Joe Carlsmith comments on What is it to solve the alignment problem?

Joe Carlsmith 19 Sep 2024 20:09 UTC
5 points
2
Hi Matthew—I agree it would be good to get a bit more clarity here. Here’s a first pass at more specific definitions.
- AI takeover: any scenario in which AIs that aren’t directly descended from human minds (e.g. human brain emulations don’t count) end up with most of the power/resources.
  - If humans end up with small amounts of power, this can still be a takeover, even if it’s pretty great by various standard human lights.
- Bad AI takeover: any AI takeover in which it’s either the case that (a) the AIs takeover via a method that strongly violates current human cooperative norms (e.g. breaking laws, violence), and/or (b) the future ends up very low in value.
  - In principle we try to talk separately about cases where (a) is true but (b) is false, and vice versa (see e.g. my post here). E.g. we could use “uncooperative takeovers” for (a), and “bad-future takeovers” for (b). But given that we want to avoid both (a) and (b), I think it’s OK to lump them together. But open to changing my mind on this, and I think your comments push me a bit in that direction.
- Alignment: this term does indeed get used in tons of ways, and it’s probably best defined relative to some specific goal for the AI’s motivations—e.g., an AI is aligned to a principal, to a model spec, etc. That said, I think I mostly use it to mean “the AI in fact does not seek power in problematic ways, given the options available to it”—what I’ve elsewhere called “practically PS-aligned.” E.g., the AI does not choose a “problematic power-seeking” option in the sort of framework I described here, where I’m generally thinking of a paradigm problematic power-seeking option as one aimed at bad takeover.
On these definitions, the scenario you’ve given is underspecified in a few respects. In particular, I’d want to know:
1. How much power do the human descended AIs—i.e., the ems—end up with?
2. Are the strange alien goals the AIs are pursuing such that I would ultimately think they yield outcomes very low in value when achieved, or not?
If we assume the answer to (1) is that the non-human-descended AIs end up with most of the power (sounds this is basically what you had in mind—see also my “people-who-like paperclips” scenario here) then yes I’d want to call this a takeover and I’d want to say that humans have been disempowered. Whether it was a “bad takeover”, and whether this was a good or bad outcome for humanity, I think depends partly on (2). If in fact this scenario results in a future that is extremely low in value, in virtue of the alien-ness of the goals the AIs are pursuing, then I’d want to call it a bad takeover despite the cooperativeness of the path getting there. I think this would also imply that the AIs are practically PS-misaligned, and I think I endorse this implication, despite the fact that they are broadly cooperative and law-abiding—though I do see a case for reserving “PS-misalignment” specifically for uncooperative power-seeking. If the resulting future is high in value, then I’d say that it was not a bad takeover and that the AIs are aligned.
Does that help? As I say, I think your comments here are pushing me a bit towards focusing specifically on uncooperative takeovers, and on defining PS-misalignment specifically in terms of AIs with a tendency to engage in uncooperative forms of power-seeking. If we went that route, then we wouldn’t need to answer my question (2) above, and we could just say that this is a non-bad takeover and that the AIs are PS-aligned.