Bootstrapped Alignment

NB: I doubt any of this is very original. In fact, it’s probably right there in the original Friendly AI writings and I’ve just forgotten where. Nonetheless, I think this is something worth exploring lest we lose sight of it.

Consider the following argument:

Optimization unavoidably leads to Goodharting (as I like to say, Goodhart is robust)
- This happens so long as we optimize (make choices) based on an observation, which we must do because that’s just how the physics work.
- We can at best make Goodhart effects happen slower, say by quantilization or satisficing.
Attempts to build aligned AI that rely on optimizing for alignment will eventually fail to become or remain aligned due to Goodhart effects under sufficient optimization pressure.
Thus the only way to build aligned AI that doesn’t fail to become and stay aligned is to not rely on optimization to achieve alignment.

This means that, if you buy this argument, huge swaths of AI design space is off limits for building aligned AI, and means many proposals are, by this argument, doomed to fail. Some examples of such doomed approaches:

HCH
debate
IRL/CIRL

So what options are left?

Don’t build AI
- The AI you don’t build is vacuously aligned.
Friendly AI
- AI that is aligned with humans right from the start because it was programmed to work that way.
- (Yes I know “Friendly AI” is an antiquated term, but I don’t know a better one to distinguish the idea of building AI that’s aligned because it’s programmed that way from other ways we might build aligned AI.)
Bootstrapped alignment
- Build AI that is aligned via optimization that is not powerful enough or optimized (Goodharted) hard enough to cause existential catastrophe. Use this “weakly” aligned AI to build Friendly AI.

Not building AI is probably not a realistic option unless industrial civilization collapses. And so far we don’t seem to be making progress on creating Friendly AI. That just leaves bootstrapping to alignment.

If I’m honest, I don’t like it. I’d much rather have the guarantee of Friendly AI. Alas, if we don’t know how to build it, and if we’re in a race against folks who will build unaligned superintelligent AI if aligned AI is not created first, bootstrapping seems the only realistic option we have.

This puts me in a strange place with regards to how I think about things like HCH, debate, IRL, and CIRL. On the one hand, they might be ways to bootstrap to something that’s aligned enough to use to build Friendly AI. On the other, they might overshoot in terms of capabilities, we probably wouldn’t even realize we overshot, and then we suffer an existential catastrophe.

One way we might avoid this is by being more careful about how we frame attempts to build aligned AI and being clear if they are targeting “strong”, perfect alignment like Friendly AI or “weak”, optimization-based alignment like HCH. I think this would help us avoid confusion in a few places:

thinking work on weak alignment is actually work on strong alignment
forgetting work on weak alignment we meant to use to bootstrap to strong alignment is not itself a mechanism for strong alignment
thinking we’re not making progress towards strong alignment because we’re only making progress on weak alignment

It also seems like it would clear up some of the debates we fall into around various alignment techniques. Plenty of digital ink has been spilled trying to suss out if, say, debate would really give us alignment or if it’s too dangerous to even attempt, and I think a lot of this could have been avoided if we thought of debate as a weak alignment techniques we might use to bootstrap strong alignment.

Hopefully this framing is useful. As I say, I don’t think it’s very original, and I think I’ve read a lot of this framing expressed in comments and buried in articles and posts, so hopefully it’s boring rather than controversial. Despite this, I can’t recall it being crisply laid out like above, and I think there’s value in that.

Let me know what you think.