Seth Herd comments on What is it to solve the alignment problem? (Notes)

Seth Herd 29 Aug 2024 1:40 UTC
5 points
1
Edit after rereading: I think maybe the overall take on alignment here is closer to my own view than I initially thought. I think the framework for thinking about what we tend to mean by alignment and all of the different routes to success is largely true and useful. I think some of the paths suggested here are highly unlikely to work, while others are quite reasonable. I’m out of time to comment in more depth on each of the many takes here. Particularly since Joe doesn’t seem to ever respond to comments here, I assume this won’t be of use to him, but may be for other readers.
I have read this and your other recent work with interest. It is very well written, even erudite. It is likely to sway some young minds. And it does give me new perspectives, which I value.
I think it’s great that you’re considering the whole problem space here. We don’t do that enough.
Edit: rereading more carefully: This post is vast. The following is only the beginning of a response.
Having said that, I do think your reconsideration doesn’t adequately build on previous thought. I’m afraid it seems to me that you’re not meeting the traditional alignment view at its strong points. If that’s correct, your erudition creates a risk of confusing a very important issue.
There is a good reason that most existing alignment work considers handing over the future to an aligned ASI as success. We do not trust humans. It is this point you don’t take seriously here.
It’s easy to look at the world and say that humans are doing rather well all in all, thank you very much.
I think you’re technically correct that co-existing with autonomous AGI that’s not fully aligned is possible. And that existing with servant AI long-term is possible.
The arguments have always been that both of those scenarios are highly unlikely to be long-term stable. My recent post If we solve alignment, do we die anyway? tries to spell out why humans in control of AGI is untenable in the long or even medium term. Similar arguments apply to semi-aligned AGI. In both cases the problem is this: when players can amplify their own intelligence and production capacity, and conceal their actions, the most vicious player wins. Changing that scenario requires drastic measures you don’t discuss. Keep playing long enough without draconian safeguards, and you’re guaranteed to get a very vicious player. They’ll attack and win and control the future, at which point we’d better hope they’re merely selfish and not sadistic.
I apologize for stating it so bluntly: it looks to me like you’re anthropomorphizing AGI through a very optimistic lens, and encouraging others to do the same. And this is coming from someone who co-authored a paper titled Anthropomorphic reasoning about neuromorphic AGI safety. I apologize for saying this. I respect you as a thinker on AGI. It’s an extremely complicated topic.
Speaking as a psychologist and neuroscientist, I think it’s important to recognize that we can’t use anthropomorphic reasoning on alignment in part because many humans aren’t aligned or safe. Sociopaths (at least some subset) will be more concerned with an injury to their little finger than with millions of deaths that won’t affect them directly.
AGI will be sociopathic by default. Evolution has created very specific mechanisms to make most humans tend toward empathy, and therefore valuable teamwork.
Those mechanisms seem to be turned down in sociopaths. AGIs will lack them by default. It’s possible that this is backward and empathy is the default, and sociopaths have extra mechanisms to turn it down/off, but that would be a result of specific brain computational schemes. AGI may well have none of those; or choose to disable them. If we try to make AGI that is pro-social, getting that right is not trivial. You seem to assume it here. Technical alignment is arguably the most important bit, and inarguably an important bit.
Or you might assume that we sort of all get along by default. That is sort of the case with humans, who are stuck with a limited mind and body roughly matching the other humans. That logic changes drastically when each being can enhance or duplicate itself without limit. If I need no allies, the smart move is to rely on no one but myself.
And humans have done very well so far, but that does not indicate that we are a good choice to control the future. There is a nonzero chance of nuclear annihilation every year; perhaps as high as 1%. The fact we’re doing the best we ever have is not a good enough reason to think we’ll continue to do great into the far future.
That’s why building a being better than us and giving it control sounds like the least-bad option.
My post I linked and other work lays out a route to get there, past the long List of Lethalities. We first do personal-intent-aligned AGI, in the hands of a non-sociopathic human. They wisely leverage that to limit AGI proliferation. Then we enjoy a long reflection and decide how to align the sovereign AGI we build. The future is finally safe from sociopathic/otherwise malign humans.
Edit:
I have more responses to your other points. I agree with many, and disagree with many. There are a lot of claims and implications here.
I agree that corrigibility/ loyal servant is the likely path to useful, safe ASI. I disagree that avoiding takeover is a workable long-term solution. I don’t think ASI with an “aversion” to powerseeking or murder is a reasonable goal, for the classic reasons; humans may be motivated by random aversions, but we’re really incoherent. We can’t expect a superintelligence to behave the same way unless it’s not only carefully engineered to do that as an ASI, but we’re really sure that its alignment will remain stable as it advances to ASI.