Bootstrapped Alignment
NB: I doubt any of this is very original. In fact, it’s probably right there in the original Friendly AI writings and I’ve just forgotten where. Nonetheless, I think this is something worth exploring lest we lose sight of it.
Consider the following argument:
Optimization unavoidably leads to Goodharting (as I like to say, Goodhart is robust)
This happens so long as we optimize (make choices) based on an observation, which we must do because that’s just how the physics work.
We can at best make Goodhart effects happen slower, say by quantilization or satisficing.
Attempts to build aligned AI that rely on optimizing for alignment will eventually fail to become or remain aligned due to Goodhart effects under sufficient optimization pressure.
Thus the only way to build aligned AI that doesn’t fail to become and stay aligned is to not rely on optimization to achieve alignment.
This means that, if you buy this argument, huge swaths of AI design space is off limits for building aligned AI, and means many proposals are, by this argument, doomed to fail. Some examples of such doomed approaches:
HCH
debate
IRL/CIRL
So what options are left?
Don’t build AI
The AI you don’t build is vacuously aligned.
AI that is aligned with humans right from the start because it was programmed to work that way.
(Yes I know “Friendly AI” is an antiquated term, but I don’t know a better one to distinguish the idea of building AI that’s aligned because it’s programmed that way from other ways we might build aligned AI.)
Bootstrapped alignment
Build AI that is aligned via optimization that is not powerful enough or optimized (Goodharted) hard enough to cause existential catastrophe. Use this “weakly” aligned AI to build Friendly AI.
Not building AI is probably not a realistic option unless industrial civilization collapses. And so far we don’t seem to be making progress on creating Friendly AI. That just leaves bootstrapping to alignment.
If I’m honest, I don’t like it. I’d much rather have the guarantee of Friendly AI. Alas, if we don’t know how to build it, and if we’re in a race against folks who will build unaligned superintelligent AI if aligned AI is not created first, bootstrapping seems the only realistic option we have.
This puts me in a strange place with regards to how I think about things like HCH, debate, IRL, and CIRL. On the one hand, they might be ways to bootstrap to something that’s aligned enough to use to build Friendly AI. On the other, they might overshoot in terms of capabilities, we probably wouldn’t even realize we overshot, and then we suffer an existential catastrophe.
One way we might avoid this is by being more careful about how we frame attempts to build aligned AI and being clear if they are targeting “strong”, perfect alignment like Friendly AI or “weak”, optimization-based alignment like HCH. I think this would help us avoid confusion in a few places:
thinking work on weak alignment is actually work on strong alignment
forgetting work on weak alignment we meant to use to bootstrap to strong alignment is not itself a mechanism for strong alignment
thinking we’re not making progress towards strong alignment because we’re only making progress on weak alignment
It also seems like it would clear up some of the debates we fall into around various alignment techniques. Plenty of digital ink has been spilled trying to suss out if, say, debate would really give us alignment or if it’s too dangerous to even attempt, and I think a lot of this could have been avoided if we thought of debate as a weak alignment techniques we might use to bootstrap strong alignment.
Hopefully this framing is useful. As I say, I don’t think it’s very original, and I think I’ve read a lot of this framing expressed in comments and buried in articles and posts, so hopefully it’s boring rather than controversial. Despite this, I can’t recall it being crisply laid out like above, and I think there’s value in that.
Let me know what you think.
- Thoughts on the Alignment Implications of Scaling Language Models by 2 Jun 2021 21:32 UTC; 82 points) (
- Introduction to Reducing Goodhart by 26 Aug 2021 18:38 UTC; 48 points) (
- [AN #140]: Theoretical models that predict scaling laws by 4 Mar 2021 18:10 UTC; 46 points) (
- Finding the Wisdom to Build Safe AI by 4 Jul 2024 19:04 UTC; 36 points) (
- 24 Oct 2023 2:49 UTC; 2 points) 's comment on Hints about where values come from by (
- 8 Jul 2022 6:57 UTC; 1 point) 's comment on Getting from an unaligned AGI to an aligned AGI? by (
Reminds me of a quote from this Paul Christiano post: “It’s a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted...I don’t think there is a strong case for thinking much further ahead than that.”
Yes, it also reminded me Christiano approach of amplification and distillation.
Thanks both! I definitely had the idea that Paul had mentioned something similar somewhere but hadn’t made it a top-level concept. I think there’s similar echos in how Eliezer talked about seed AI in the early Friendly AI work.
I’m confused, I don’t know what you mean by ‘Friendly AI’. If I take my best guess for that term, I fail to see how it does not rely on optimization to stay aligned.
I take ‘Friendly AI’ to be either:
An AI that has the right utility function from the start. (In my understanding, that used to be its usage.) As you point out, because of Goodhart’s Law, such an AI is an impossible object.
A mostly-aligned AI, that is designed to be corrigible. Humans can intervene to change its utility function or shut it down as needed to prevent it from taking bad actions. Ideally, it would consult human supervisors before taking a potentially bad action.
In the second case, humans are continuously optimizing the “utility function” to be closer to the true one. Or, modifying the utility function to make “shut down” the preferred action, whenever the explicit utility function presents a ‘misaligned’ preferred outcome. Thus, it also represents an optimization-based weak alignment method.
Would you argue that my second definition is also an impossible object, because it also relies on optimization?
I think part of my confusion comes from the very fuzzy definition of “optimization”. How close, and how fast, do you have to get to the maximum possible value of some function U(s) to be said to optimize it? Or is this the entirely wrong framework altogether? There’s no need to answer these now, I’m mostly curious about a clarification for ‘Friendly AI’.
“Friendly AI” is a technical term from the past that has mostly been replaced by “aligned AI” today. However, I’m using it here to refer to aligned AI conforming to an aspect of the original proposal for Friendly AI, which is that it be designed to be aligned, say in a mathematically provable way, rather than as an engineered process that approaches alignment by approximation.
It’s still the case that humans are choosing what criteria make a Friendly AI aligned and thus there is some risk of missing the objective of aligned AI, but this avoids Goodharting because there’s no optimization being applied. Of course, it could always slip back in depending on the process used to come up with the criteria a Friendly AI would be built to provably have, thus making the challenge of building one quite hard!
As to your second set of questions that seem to hinge on what I mean by optimization, I just mean choosing one thing over another to try to make the world look one way rather than another. If that still seems vague it’s because optimization is a very common process that basically just requires a feedback loop and a signal (reward functions are a very complex type of signal).
I think I understand that now, thank you!
I’m confused again here. Is this implying that a Friendly AI, per the definition above, is not an optimizer?
I am very pessimitic about being able to align an AI without any sort of feedback loop on the reward (thus, without optimization). The world’s overall transition dynamics are likely to be chaotic, so the “initial state” of an AI that is provably aligned without feedback needs to be exactly the right one to obtain the outcome we want. It could be that the chaos does not affect what we care about, but I’m unsure about that, even linear systems can be chaotic.
It is not an endeavour as clearly impossible as “build an open-loop controller for this dynamical system”, but I think it’s similar.
No. It’s saying the process by which Friendly AI is designed is not an optimizer (although see my caveats in the previous apply about choosing alignment criteria; it’s still technically optimization but constrained as much as possible to eliminate the normal Goodharting mechanism). The AI itself pretty much has to be an optimizer to do anything useful.
I’m similarly pessimistic as it seems quite a hard problem and after 20 years we still don’t really know how to start (or so I think; maybe MIRI folks feel differently and that we have made some real progress here). Hence why maybe bootstrapping to alignment is the best alternative given I think totally abandoning the Friendly AI strategy is also a bad choice.
What happens if we revise or optimize our metrics?
Sufficient optimization pressure from the AI? Or are there risks associated from a) our mitigation efforts, like reducing optimization decreases friendliness ‘because of Goodhart’s Law’, or b) the more we try to make an AI friendly/not optimize/etc. the more risks there are from that optimization process?
Planned summary for the Alignment Newsletter:
Looks good to me! Thanks for planning to include this in the AN!
I’m still holding out hope for jumping straight to FAI :P Honestly I’d probably feel safer switching on a “big human” than a general CIRL agent that models humans as Boltzmann-rational.
Though on the other hand, does modern ML research already count as trying to use UFAI to learn how to build FAI?
Seems like it probably does, but only incidentally.
I instead tend to view ML research as the background over which alignment work is now progressing. That is, we’re in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment.
I do think there’s something interesting about a direction not considered in this post related to intelligence enhancement of humans and human emulations (ems) as a means to working on alignment, but I think realistically current projections of AI capability timelines suggest they’re unlikely to have much opportunity for impact.