LawrenceC comments on “AI Safety for Fleshy Humans” an AI Safety explainer by Nicky Case

LawrenceC May 4, 2024, 6:09 AM
11 points
8
Also, another nitpick:
Humane vs human values
I think there’s a harder version of the value alignment problem, where the question looks like, “what’s the right goals/task spec to put inside a sovereign ai that will take over the universe”. You probably don’t want this sovereign AI to adopt the value of any particular human, or even modern humanity as a whole, so you need to do some Ambitious Value Learning/moral philosophy and not just intent alignment. In this scenario, the distinction between humane and human values does matter. (In fact, you can find people like Stuart Russell emphasizing this point a bunch.) Unfortunately, it seems that ambitious value learning is really hard, and the AIs are coming really fast, and also it doesn’t seem necessary to prevent x-risk, so...
Most people in AIS are trying to solve a significantly less ambitious version of this problem: just try to get an AI that will reliably try to do what a human wants it to do (i.e. intent alignment). In this case, we’re explicitly punting the ambitious value learning problem down the line. Here, we’re basically not talking about the problem of having an AI learn humane values, but instead the problem of having it “do what its user wants” (i.e. “human values” or “the technical alignment problem” in Nicky’s dichotomy). So it’s actually pretty accurate to say that a lot of alignment is trying to align AIs wrt “human values”, even if a lot of the motivation is trying to eventually make AIs that have “humane values”.^[1] (And it’s worth noting that making an AI that’s robustly intent aligned sure seems require tackling a lot of the ‘intuition’-derived problems you bring up already!)
uh, that being said, I’m not sure your framing isn’t just … better anyways? Like, Stuart seems to have lots of success talking to people about assistance games, even if it doesn’t faithfully represent what a majority field thinks is the highest priority thing to work on. So I’m not sure if me pointing this out actually helps anyone here?
1. ^
  Of course, you need an argument that “making AIs aligned with user intent” eventually leads to “AIs with humane values”, but I think the straightforward argument goes through—i.e. it seems that a lot of the immediate risk comes from AIs that aren’t doing what their users intended, and having AIs that are aligned with user intent seems really helpful for tackling the tricky ambitious value learning problem.