Mau comments on My Overview of the AI Alignment Landscape: Threat Models

Mau 26 Dec 2021 6:52 UTC
11 points
AF
I’m still pretty confused by “You get what you measure” being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I’ll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of this earlier comment on the overview post). Broadly, I’ll be arguing that: power-seeking AI is necessary for “you get what you measure” issues posing existential threats, so “you get what you measure” concerns are best thought of as a sub-threat model of power-seeking AI.

(Edit: An aspect of “you get what you measure” concerns—the emphasis on something like “sufficiently strong optimization for some goal is very bad for different goals”—is a tweaked framing of power-seeking AI risk in general, rather than a subset.)

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world.

Unless I’m missing something, this is just an argument for why AI might get locked in—not an argument for why misaligned AI might get locked in. AI becoming an indispensable part of the economy isn’t a long-term problem if people remain capable of identifying and fixing problems with the AI. So we still need an additional lock-in mechanism (e.g. the initially deployed, misaligned AI being power-seeking) to have trouble. (If we’re wondering how hard it will be to fix/improve non-power-seeking AI after it’s been deployed, the difficulty of banning social media doesn’t seem like a great analogy; a more relevant analogy would be the difficulty of fixing/improving social media after it’s been deployed. Empirically, this doesn’t seem that hard. For example, YouTube’s recommendation algorithm started as a click-maximizer, and YouTube has already modified it to learn from human feedback.)

See Sam Clarke’s excellent post for more discussion of examples of lock-in.

I don’t think Sam Clarke’s post (which I’m also a fan of) proposes any lock-in mechanisms that (a) would plausibly cause existential catastrophe from misaligned AI and (b) do not depend on AI being power-seeking. Clarke proposes five mechanisms by which Part 1 of “What Failure Looks Like” could get locked in—addressing each of these in turn (in the context of his original post):
- (1) short-term incentives and collective action—arguably fails condition (a) or fails condition (b); if we don’t assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
- (2) regulatory capture—the worry here is that the companies controlling AI might have and permanently act on bad values; this arguably fails condition (a), because if we’re mainly worried about AI developers being bad, then focusing on intent alignment doesn’t make that much sense.
- (3) genuine ambiguity—arguably fails condition (a) or fails condition (b); if we don’t assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
- (4) dependency and deskilling—addressed above
- (5) [AI] opposition to [humanity] taking back influence—clearly fails condition (b)
So I think there remains no plausible alignment-relevant threat model for “You get what you measure” that doesn’t fall under “power-seeking AI.”
- paulfchristiano 26 Dec 2021 7:08 UTC
  LW: 12 AF: 7
  AF Parent
  I’m still pretty confused by “You get what you measure” being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model)
  I also consider catastrophic versions of “you get what you measure” to be a subset/framing/whatever of “misaligned power-seeking.” I think misaligned power-seeking is the main way the problem is locked in.
  To a lesser extent, “you get what you measure” may also be an obstacle to using AI systems to help us navigate complex challenges without quick feedback, like improving governance. But I don’t think that’s an x-risk in itself, more like a missed opportunity to do better. This is in the same category as e.g. failures of the education system, though it’s plausibly better-leveraged if you have EA attitudes about AI being extremely important/leveraged. (ETA: I also view AI coordination, and differential capability progress, in a similar way.)
  What links here?
  - Mau's comment on Classifying sources of AI x-risk by Sam Clarke (EA Forum; 8 Aug 2022 23:47 UTC; 5 points)
  - Mau's comment on What failure looks like by paulfchristiano (26 Dec 2021 8:06 UTC; 2 points)