I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It’s guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly—such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That’s not what surviving worlds look like.
To say that somebody else should have written up this list before is such a ridiculously unfair criticism. This is an assorted list of some thoughts which are relevant to AI alignment—just by the combinatorics of how many such thoughts there are and how many you chose to include in this list, of course nobody will have written up something like it before. Every time anybody writes up any overview of AI safety, they have to make tradeoffs between what they want to include and what they don’t want to include that will inevitably leave some things off and include some things depending on what the author personally believes is most important/relevant to say—ensuring that all such introductions will always inevitably cover somewhat different material. Furthermore, many of these are responses to particular bad alignment plans, of which there are far too many to expect anyone to have previously written up specific responses to.
Nevertheless, I am confident that every core technical idea in this post has been written about before by either me, Paul Christiano, Richard Ngo, or Scott Garrabrant. Certainly, they have been written up in different ways than how Eliezer describes them, but all of the core ideas are there. Let’s go through the list:
(3) This is a common concept, see e.g. Homogeneity vs. heterogeneity in AI takeoff scenarios (“Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely tano determine/be highly correlated with whether all future AIs built after that point are aligned as well.”).
To spot check the above list, I generated the following three random numbers from 1 − 36 after I wrote the list: 32, 34, 15. Since 34 corresponds to a particular bad plan, I then generated another to replace it: 14. Let’s spot check those three—14, 15, 32—more carefully.
(14) Eliezer claims that “Some problems, like ‘the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment’, seem like their natural order of appearance could be that they first appear only in fully dangerous domains.” In Risks from Learned Optimization in Advanced Machine Learning Systems, we say very directly:
In current AI systems, a small amount of distributional shift between training and deployment need not be problematic: so long as the difference is small enough in the task-relevant areas, the training distribution does not need to perfectly reflect the deployment distribution. However, this may not be the case for a deceptively aligned mesa-optimizer. If a deceptively aligned mesa-optimizer is sufficiently advanced, it may detect very subtle distributional shifts for the purpose of inferring when the threat of modification has ceased.
[...] Some examples of differences that a mesa-optimizer might be able to detect include:
[...]
The presence or absence of good opportunities for the mesa-optimizer to defect against its programmers.
(15) Eliezer says “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.” Richard says:
If AI development proceeds very quickly, then our ability to react appropriately will be much lower. In particular, we should be interested in how long it will take for AGIs to proceed from human-level intelligence to superintelligence, which we’ll call the takeoff period. The history of systems like AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance. A similar phenomenon occurred during human evolution, where it only took us a few million years to become much more intelligent than chimpanzees. In our case one of the key factors was scaling up our brain hardware—which, as I have already discussed, will be much easier for AGIs than it was for humans.
While the question of what returns we will get from scaling up hardware and training time is an important one, in the long term the most important question is what returns we should expect from scaling up the intelligence of scientific researchers—because eventually AGIs themselves will be doing the vast majority of research in AI and related fields (in a process I’ve been calling recursive improvement). In particular, within the range of intelligence we’re interested in, will a given increase δ in the intelligence of an AGI increase the intelligence of the best successor that AGI can develop by more than or less than δ? If more, then recursive improvement will eventually speed up the rate of progress in AI research dramatically.
Note: for this one, I originally had the link above point to AGI safety from first principles: Superintelligence specifically, but changed it to point to the whole sequence after I realized during the spot-checking that Richard mostly talks about this in the Control section.
(32) Eliezer says “This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.” In An overview of 11 proposals for building safe advanced AI, I say:
if RL is necessary to do anything powerful and simple language modeling is insufficient, then whether or not language modeling is easier is a moot point. Whether RL is really necessary seems likely to depend on the extent to which it is necessary to explicitly train agents—which is very much an open question. Furthermore, even if agency is required, it could potentially be obtained just by imitating an actor such as a human that already has it rather than training it directly via RL.
and
the training competitiveness of imitative amplification is likely to depend on whether pure imitation can be turned into a rich enough reward signal to facilitate highly sample-efficient learning. In my opinion, it seems likely that human language imitation (where language includes embedded images, videos, etc.) combined with techniques to improve sample efficiency will be competitive at some tasks—namely highly-cognitive tasks such as general-purpose decision-making—but not at others, such as fine motor control. If that’s true, then as long as the primary economic use cases for AGI fall into the highly-cognitive category, imitative amplification should be training competitive. For a more detailed analysis of this question, see “Outer alignment and imitative amplification.”
Sure—that’s easy enough. Just off the top of my head, here’s five safety concerns that I think are important that I don’t think you included:
The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.
It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
Any prior you use to incentivize models to behave in a particular way doesn’t necessarily translate to situations where that model itself runs another search over algorithms. For example, the fastest way to search for algorithms isn’t to search for the fastest algorithm.
Even if a model is trained in a myopic way—or even if a model is in fact myopic in the sense that it only optimizes some single-step objective—such a model can still end up deceiving you, e.g. if it cooperates with other versions of itself.