Seth Herd comments on Alexander and Yudkowsky on AGI goals

Seth Herd 13 Dec 2024 23:55 UTC
6 points
0
I don’t think this really qualifies for year’s best. It’s interesting if you think or have to explain to someone who thinks “just raise an RL mechanism in a human environment and it would come out aligned, right?” I’m surprised anyone thinks that, but here’s a pretty good writeup of why you shouldn’t.

The biggest portion is about why we shouldn’t expect an AGI to become aligned by exposing an RL system to a human—like environment. A child, Alexander says, might be punished for stealing a cookie, and it could internalize the rule “don’t get caught stealing” or the rule “don’t steal”. If the latter, they’re aligned.

Yudkowsky says that’s not how humans get aligned; we have social instincts designed in by evolution; it’s not just plain RL learning. When asked, he says “Actual answer: Because the entire field of experimental psychology that’s why.”

Having worked in/adjacent to experimental psychology for 20 years or so, I think this is absolutely correct. It’s very clear at this point that we have complex instincts that guide and shape our general learning. I would not expect an RL learner to come away with anything like a human value system if it was exposed to human society. I think this should be taken as a given. We are not blank slates, and there is voluminous direct and indirect evidence to that effect.

EY: “So the unfortunate answer to “How do you get humans again?” is “Rerun something a lot like Earth” which I think we both have moral objections about as something to do to sentients.”

Also right, but the relevant question is how you get something close enough to humans to roughly share our ethical systems. I’m not even sure that’s adequate; humans look a lot more aligned when they don’t possess immense power. But maybe we could get something close enough and a bit better by leaving out some of the dangerous instincts like anger. This is Steve Byrnes’ research agenda. I don’t think we have long enough, since my timeline estimates are mostly short. But it could work.

There’s some interesting stuff about the outlines of a mechanism for social instincts. I also endorse EYs summary of the neuroscience as almost certainly correct.

at 1600 YK refers to (I think) human values as something like trapped priors- I think this is right. An AGI would not be likely to get the same priors trapped in the same way, even if it did have similar instincts for prosocial behavior. I’m not sure a human would retain similar values if we lived even 300 years in varied environments and just kept thinking hard about our values.

The remainder of the post is on acausal trades as a route to human survival (e.g., an AGI saving us in case it meets another AGI that was aligned and willl retaliate against a murderous AGI on principal). This is not something I’ve wrapped my head around.

EY says: “Frankly, I mostly consider this to be a “leave it to MIRI, kids” question”

And I’m happy to do that if he says so. I also don’t care to bet the future on a theory that only an insular group even thinks they understand, so I’m hoping we get way more legible and promising alignment plans (and I think we do, which is mostly what I write about).