Lauro Langosco
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
Ability to be deceptively aligned
Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
Incentives to break containment exist in a way that is accessible / understandable to the model
Ability to break containment
Ability to robustly understand human intent
Situational awareness
Coherence / robustly pursuing it’s goal in a diverse set of circumstances
Interpretability methods break (or other oversight methods break)
doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
Capable enough to help us exit the acute risk period
Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of ‘full alignment’ (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:
Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
Too little capability (or too little ‘coherence’) and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
Lauro Langosco’s Shortform
Yeah I don’t think the arguments in this post on its own should convince that P(doom) is high you if you’re skeptical. There’s lots to say here that doesn’t fit into the post, eg an object-level argument for why AI alignment is “default-failure” / “disjunctive”.
Thanks for link-posting this! I’d find it useful to have the TLDR at the beginning of the post, rather than at the end (that would also make the last paragraph easier to understand). You did link the TLDR at the beginning, but I still managed to miss it on the first read-through, so I think it would be worth it.
Also: consider crossposting to the alignmentforum.
Edit: also, the author is Eliezer Yudkowsky. Would be good to mention that in the intro.
I like that mini-game! Thanks for the reference
like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there’s a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you’re forced into exotic and unlikely training data, and you win if i’m either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.
FWIW I would love to see the result of you two actually playing a few rounds of this game.
An Exercise to Build Intuitions on AGI Risk
It’s unclear to me what it would even mean to get a prediction without a “model”. Not sure if you meant to imply that, but I’m not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).
Uncertainty about the future does not imply that AGI will go well
More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.
Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?
Newtonian gravity states that objects are attracted to each other in proportion to their mass. A webcam video of two apples falling will show two objects, of slightly differing masses, accelerating at the exact same rate in the same direction, and not towards each other. When you don’t know about the earth or the mechanics of the solar system, this observation points against Newtonian gravity. [...] But it requires postulating the existence of an unseen object offscreen that is 25 orders of magnitude more massive than anything it can see, with a center of mass that is roughly 6 or 7 orders of magnitude farther away than anything it can see in it’s field of view.
IMO this isn’t that implausible. A superintelligence (and in fact humans too) will imagine a universe that is larger than what’s inside the frame of the image. Once you come up with the idea of an attractive force between masses, it’s not crazy to deduce the existence of planets.
I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I’d call 1) an instance of misspecification and 2) an instance of misgeneralization.
(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I’d have to reread more carefully to make sure).
I agree with much of the rest of this post, eg the paragraphs beginning with “The solutions to these two problems are pretty different.”
Here’s our definition in the RL setting for reference (from https://arxiv.org/abs/2105.14111):
A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.
FWIW I think this definition is flawed in many ways (for example, the type signature of the agent’s inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment’s state space; and also it’s generally sketchy to extend the reward function beyond the training distribution), but I don’t know of a different definition that doesn’t have similarly-sized flaws.
It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I’m missing some of the problems.
(Analogy: say I’m working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I’m actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).
I agree with what I read as the main direct claim of this post, which is that it is often worth avoiding making very confident-sounding claims, because it makes it likely for people to misinterpret you or derail the conversation towards meta-level discussions about justified confidence.
However, I disagree with the implicit claim that people who confidently predict AI X-risk necessarily have low model uncertainty. For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can’t even predict basic things like “when will this system have situational awareness”, etc.
To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don’t see continuity or “Things will be weird before getting extremely weird” as a crux. (I don’t know why you think he does, and don’t know what he thinks, but would guess he doesn’t think it’s a crux either)
Thanks for doing this! I think this could be valuable. What’s your current plan for developing this further / onboarding collaborators?
Some observations / thoughts from interacting with the QA system for a few minutes:
The current version has trouble answering basic AI safety questions. For example:
Why do people think AI is an existential risk? People think AI is an existential risk because of the possibility of a superintelligent AI system with recursive self-improvement capabilities, which could lead to catastrophic consequences like turning humans into paper clips, as highlighted in[1]. However, the perception of AI as an existential risk is not universal, and some people may discount it as a problem that is too vague or unlikely to outweigh other issues facing the world, as discussed in[1]. Nonetheless, technical AI safety research, defined as quality research aimed at long-term outcomes that are reliably beneficial, is necessary to mitigate the risks of AI[2]. Additionally, some argue that rapid increases in human power could be achieved without increasing existential risks, including through skillful AI development, but it is important to prioritize technical AI safety research to avoid catastrophic consequences[3].
The search could be much improved by curating the dataset. An easy way to do this is to exclude posts with low upvote counts. Though eventually you’ll want to do more, eg by being opinionated about what to include.
It might be worth having a chatbot that just talks people through the “extended bad alignment take bingo”, that is all the reasons why the easy solutions people like to come up with don’t work. Here you could just exclude all proposals for actual alignment solutions from the dataset (and you can avoid having to make calls about what agendas have promise vs. which ones are actually nonsensical)
It would be very useful to have a feedback function where people can mark wrong answers. If we want to make this good, we’ll need to red-team the model and make sure it answers all the basic questions correctly, probably by curating a Question-Answer dataset
This seems wrong. Here’s an incomplete list of reasons why:
If the 3 leading labs join the moratorium and AGI is stealthily developed by the 4th, then the arrival of AGI will in fact have been slowed by the lead time of the first 3 labs + the slowdown that the 4th incurs by working in secret.
The point of this particular call for a 6-month moratorium is not to particularly slow down anyone (and as has been pointed out by others, it is possible that OpenAI wasn’t even planning to start training GPT-5 in the next few months). Rather, the point is to form a coalition to support future policies, e.g. a government-supported moratorium.
It is actually fairly hard to build compute clusters in secret, because you can just track what comes out of the chip fabs and where it goes
While not straightforward, it’s also feasible to monitor existing clusters, see e.g. https://arxiv.org/abs/2303.11341
Yeah we’re on the same page here, thanks for checking!
For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?
I feel pretty uncertain about all the factors here. One reason I overall still lean towards the ‘definitely not’ stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).
For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary.
I agree we’re not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any “aligned” behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I’m not sure “coherent” is the right way to talk about this… wish I had a more precise concept here.)
We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.
I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren’t going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn’t lead to a good outcome, then the tAGI can reason the same way about its own desires).
(I agree that if we can get aligned desires that are stable under reflection, then maybe the ‘use non-endorsed desires to tide us over’ plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that—we currently just don’t have that level of fine control over capabilities).
The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate’s story doesn’t hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.
Some other difficulties that I see:
The ‘capability profile’ (ie the relative levels of the toddler AGI’s capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we’re at least careful enough to remove code from the training data, etc).
A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
All of my reasoning here is kind of based on fuzzy confused concepts like ‘coherence’ and ‘capability to self-reflect’, and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.
Yeah that seems reasonable! (Personally I’d prefer a single break between sentence 3 and 4)
IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it’s much harder, but I think it’s good to at least try, as long as it doesn’t terribly hurt less ambitious efforts (which I think it doesn’t).