Getting from an unaligned AGI to an aligned AGI?
Summary / PreambleIn AGI Ruin: A List of Lethalities, Eliezer writes “A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.”
From the AGI-system we may (directly or indirectly) obtain programs that are interpretable and verifiable. These specialized programs could give us new capabilities, and we may trust these capabilities to be aligned and safe (even if we don’t trust the AGI to be so). We may use these capabilities to help us with verification, widening the scope of programs we are able to verify (and maybe helping us to make the AGI-system safer to interact with). This could perhaps be a positive feedback-loop of sorts, where we get more and more aligned capabilities, and the AGI-system becomes safer and safer to interact with. The reasons for exploring these kinds of strategies are two-fold:
The strategy as a whole involves many iterative and contingency-dependent steps working together. I don’t claim to have a 100% watertight and crystalized plan that would get us from A to B. Maybe some readers could be inspired to build upon some of the ideas or analyze them more comprehensively. Are any of the ideas in this series new? See here for a discussion of that. |
Me: I have some ideas about how how to make use of an unaligned AGI-system to make an aligned AGI-system.
Imaginary friend: My system 1 is predicting that a lot of confused and misguided ideas are about to come out of your mouth.
Me: I guess we’ll see. Maybe I’m missing the mark somehow. But do hear me out.
Imaginary friend: Ok.
Me: First off, do we agree that a superintelligence would be able to understand what you want when asking for something, presuming that it is given enough information?
Imaginary friend: Well, kind of. Often there isn’t really a clear answer to what you want.
Me: Sure. But it would probably be good at predicting what looks to me like good answers. Even if it isn’t properly aligned, it would probably be extremely good at pretending to give me what I want. Right?
Imaginary friend: Agreed.
Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful—with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned. Right?
Imaginary friend: I guess I sort of agree. Like, if it answered your request it would probably look really convincing. But maybe it doesn’t answer your question. It could find a security vulnerability in the OS, and hack itself onto the internet somehow—that would be game over before you even got to ask it any questions. Or maybe you didn’t even try to box it in the first place, since you didn’t realize how capable your AI-system was getting, and it was hiding its capabilities from you.
Or maybe it socially manipulated you in some really clever way, or “hacked” your neural circuitry somehow through sensory input, or figured out some way it could affect the physical world from within the digital realm (e.g. generating radio waves by “thinking”, or some thing we don’t even know is physically possible).
When we are dealing with a system that may prefer to destroy us (for instrumentally convergent reasons), and that system may be orders of magnitude smarter than ourselves—well, it’s better to be too careful than not paranoid enough..
Me: I agree with all that. But it’s hard to cover all the branches of things that should be considered in one conversation-path. So for the time being, let’s assume a hypothetical situation where the AI is “boxed” in. And let’s assume that we know it’s extremely capable, and that it can’t “hack” itself out of the box in some direct way (like exploiting a security flaw in the operating system). Ok?
Imaginary friend: Ok.
Me: I presume you agree that there are more and less safe ways to use a superintelligent AGI-system? To give an exaggerated example: There is a big difference between “letting it onto the internet” and “having it boxed in, only giving it multiplication questions, and only letting it answer yes or no”.
Imaginary friend: Obviously. But even if you only give it multiplication-questions, some other team will sooner or later develop AGI and be less scrupulous..
Me: Sure. But still, we agree that there are more and less safe to try to use an AGI? There is a “scale” of sorts?
Imaginary friend: Of course.
Me: Would you also agree that there is a “scale” for how hard it is for an oracle/genie to “trick” you into falsely believing that it has provided you with what you want? For example, if I ask it to prove a mathematical conjecture, that is much harder to “pretend” to do the way I want it without actually doing it (compared to most things)?
Imaginary friend: Sure.
Me: What I want to talk about are ways of asking an AGI genie/oracle for things in ways where it’s hard for it to “pretend” that it’s giving us what we want without doing it. And ways we might leverage that to eventually end up with an aligned AGI-system, while trying to keep the total risk (of all the steps we take) low.
Imaginary friend: My system 1 suspects I am about to hear some half-baked ideas.
Me: And your system 1 may have a point. I don’t claim to have detailed and watertight arguments showing exactly how we get from A to B. What I have is an outline of how we might get there, while minimizing risk along the way (not to 0%, but to way less than 50% if suggestions are executed faithfully).
Believe me, I don’t have full knowledge and understanding of everything that has been explored in the AI alignment community…
Imaginary friend: I believe you.
Me: …but it seems to me that the kinds of techniques and strategies I’ll be outlining are under-discussed.
Imaginary friend: It sounds like you are looking for ways to avoid having to solve the alignment problem, so that you can focus on other things instead of solving it beforehand.
Me: I guess it can be interpreted that way. I don’t want it to come across that way though! If we find reliable methods to align a superintelligent AGIs before we get superintelligent AGIs, then that would improve our odds of success!
But maybe we don’t. If we don’t then it’s better to have some plans ready, so that less improvisation is necessary. AFAIK there is no fire alarm for AGI, and it certainly doesn’t seem too early to start analyzing these kinds of scenarios in a more comprehensive way..
And also, even if we think we have solved alignment in a reliable way prior to superintelligence, some of the strategies and principles I’ll discuss can be used as additional layers of alignment-assurance. Hence me thinking that it may be a good idea to develop these ideas further, and to encourage them as best practice.
Restrictions in expressivity of AI
Me: The first part of this discussion will focus on things that we can do while the AGI stays in the digital realm. We will get to the question of how to do complicated things in the non-digital realm (and on the internet), but that’s for later.
Imaginary friend: You understand that any use of an AGI enables it to affect the non-digital realm, right? If you let it write or modify code, that is a way for the AGI to communicate with humans, and communicating with humans is a way to affect the non-digital realm.
Me: Yes, I agree with that. But I would point out that it is up to us how much “expressivity” we give the AI-system. For example, if we get help from it to point out potential mistakes in some piece of code, maybe we don’t let it write text to explain itself (limiting instead to things such as pointing out areas of code to take an extra look at, highlighting examples of input that may produce unintended behavior, etc). And if we do let it write text, then it is up to us how we limit the expressivity of what it can write in text.
Imaginary friend: Ok, did you by the way notice that you spoke about not letting it write text—but letting it write example input? Well, example input can be text! Those kinds of things can be easy to miss, and often you may not be aware of what you’re missing.
Me: I may miss such things sometimes, but often I’ll be aware and leave them out so as not to get into tangents. I did notice the one you pointed out now (as well as several more in the example I just gave). Partly thanks to you helping me to get better at noticing those sorts of things.
Imaginary friend: That’s what imaginary friends are for.
Keep in mind that any way to point out bugs or potential problems is a way for the AI to affect how you think and what you think about. Take this test of selective attention for example. Or look at how magicians focus and misdirect people’s attention to trick them. There are lots of things like this.
Me: I’m certainly not claiming that the techniques I’ll describe are risk-free. At the same time, it’s important to consider how different techniques can come together to reduce risk. For example, one thing that will be touched upon again and again is testing the convergence of the results from different methods and architectures and pieces of code. Preferably, if an AI-system writes the code for some module, then the overall system should ensure that we also have that module be written by other AI-systems, and also by humans if possible.
In regards to e.g. AIs pointing out potential issues, the system as a whole (a system of systems of systems..) would preferably keep track of the changes made due to such suggestions, and also try to keep track of what the output of the system would be if those changes hadn’t been made.
Imaginary friend: Ok. But just because something makes you more safe, and you can handwave to a lot of techniques where each thing helps a bit with safety—this does not show that the process as a whole will be safe. And I am a bit worried about you proposing ideas that are hard to criticize because they’re complicated and vague, instead of getting to the “core” of the problem. Also, adding on lots of complications to increase security can in some cases be counter-productive.
Me: Some good points there. Although do I feel uncertain about the degree to which there is a “core” of the problem. I want this series to outline a path towards a more “complete” solution, but I want to start by focusing more on techniques and principles that are “building-blocks”.
The best scenario is if the first superintelligent AGI is robustly aligned from the start. I think we should try hard to make that be the case, and I don’t hope this series is much of a distraction from that.
I am reminded of how I am in support of trying to reduce CO2 emissions today, but also support research on geoengineering.
Topics I’ll cover in this series
Imaginary friend: So where do you see this discussion going?
Me: I guess we’ll see. But I do have some plans.
One thing I want to discuss is the concept of “solution-generators” (and maybe also “solution-generators-generators”, etc). When asking an oracle/genie for something, it may often make sense to not ask the question directly, but to instead ask for a “solution-generator”—that is to say, a function that generates solutions within the domain in question. These “generators” should be optimized for being narrow/specialized/modular/verifiable (techniques for trying to ensure that the system is faithfully optimizing the “generator” for these properties is a topic by itself).
For example, if we want to do a physics-simulation, then we might want to use a “simulation-generator” or “simulation-generator-generator”. One merit of this is that the code can be easier to inspect and understand than the code of a superintelligent AGI. Another merit is that the “generator” can be tested against existing data.
But of course, you can also ask it to answer the question directly (in addition to asking for the “solution-generator”). And then you can see if the output is the same when you obtain it indirectly vs when you obtain it directly.
Imaginary friend: Isn’t it better not to ask it the specific question, so that it’s harder for it to guess what you’re after?
Me: I will mostly not be mentioning this sort of thing explicitly, but yes I agree, you shouldn’t ask both questions of the same instance of the AGI-system. The instance that is asked for A should not know that you also have asked it for B. Though of course, you could have additional instances that are asked both questions, and check if results/output converge.
Another topic I want to speculate a bit about is the topic of brain emulations, and also other “generators” that try to predict the behavior and thinking of humans. I’m interested in the question of “Can we bootstrap from AGI to either brain emulations and/or something else that can predict human behavior, while more or less staying within the digital realm?”. Obviously, there are lots of ethical considerations here, and it’s very important to avoid suffering subroutines insofar as possible!
Me: I also want to discuss techniques involving computable argument-trees/”proofs”, where every inference-step is made explicit. As well as outlining possible techniques to have the concepts/propositions of such proofs represent more or less any thought that is sufficiently “clear/crisp/explicit” (blurring the distinction between “mathematical” proofs and any other argument about anything). Included in the discussion will be outlines of ideas for how to deal with “vagueness” and cluster-like concepts within such argument-trees/”proofs”.
And I’ll be outlining thoughts about capabilities that I think will help with verifying that instructions for doing things in the real world will work as intended. Such as for example copying a strawberry at the molecular level without unintended consequences.
The more smart and thoughtful people who see things differently from you, the more reason for self-doubt about your own judgment. And this is for me a significant source of uncertainty about my ideas in regards to alignment (and AI more generally). But it seems best to me to try to describe my perspective as well as I can, and then people can do with that what seems best to them.
Imaginary friend: Talk to you later then.
Any feedback or comments (be that positive or negative or neither) would be received with interest.
- What’s the contingency plan if we get AGI tomorrow? by 23 Jun 2022 3:10 UTC; 61 points) (
- Making it harder for an AGI to “trick” us, with STVs by 9 Jul 2022 14:42 UTC; 15 points) (
- 7 Jul 2022 21:37 UTC; 4 points) 's comment on Principles for Alignment/Agency Projects by (
- 22 Jun 2022 7:13 UTC; 4 points) 's comment on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment by (
- 22 Jun 2022 9:01 UTC; 2 points) 's comment on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment by (
- 24 Jun 2022 14:01 UTC; 1 point) 's comment on Where I agree and disagree with Eliezer by (
I skimmed the post, here is some feedback (context):
I’m probably not the best person to red team this since some of my own alignment ideas are along similar lines. I’m also a bit on the optimistic side about alignment more generally—it might be better to talk to a pessimist.
This sounds a bit like the idea of a “low-bandwidth oracle”.
I think the biggest difficulty is the one you explicitly acknowledged—boxing is hard.
But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the AGI’s output conforms to the spec, the AGI could put an infohazard in the output, or write code which looks like it does one thing but actually does another. If you have a computer program or AI check the AI’s output, the AGI could exploit a flaw in that program/AI. I think this sort of thing basically gets you a probabilistic safety measure, because there’s always a risk that there’s a flaw that the superintelligent AGI sees (or can infer) that you don’t see. (I like this intuition pump for seeing why these sort of problems are plausible.) I think probabilistic safety measures can be good if we stack a lot of them together in the right way.
The idea of emitting machine-checkable proofs is interesting. I’m not sure such proofs are very useful though. “Finding the right spec is one of the biggest challenges in formal methods.”—source. And finding the right spec seems more difficult to outsource to an unfriendly AI. In general, I think using AI to improve software reliability seems good, and tractable.
I think you’ll find it easier to get feedback if you keep your writing brief. Assume the reader’s time is valuable. Sentences like “I will mention some stuff later that maybe will make it more clear how I’d think about such a question.” should simply be deleted—make huge cuts. I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post. Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.
Similarly, I think it is worth investing in clarity. If a sentence is unclear, I have a tendency to just keep reading and not mention it unless I have a prior that the author knows what they’re talking about. (The older I get, the more I assume that unclear writing means the author is confused and ignorable.) I like writing advice from Paul Graham and Scott Adams.
Personally I’m more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading. I don’t have much time to do feedback right now unfortunately.
Thanks, that’s interesting. Hadn’t seen that (insofar as I can remember). Definitely overlap there.
Same, and that’s a good/crisp way to put it.
Will edit at some point so as to follow the first part of that suggestion. Thanks!
Some things in that bullet-list addresses stuff I left out to cut length, and stuff I though I would address in future parts of the series. Found also those parts of bullet-list helpful, but still this exemplifies dilemmas/tradeoffs regarding length. Will try to make more effort to look for things to make shorter based on your advice. And I should have read through this one more before publishing.
I have a first draft ready for part 2 now: https://docs.google.com/document/d/1RGyvhALY5i98_ypJkFvtSFau3EmH8huO0v5QxZ5v_zQ/edit
Will read it over more, but plan to post within maybe a few days.
I have also made a few changes to part 1, and will probably make additional changes to part 1 over time.
As you can see if you open the Google Doc, part 2 is not any shorter than part 1. You may or may not interpret that as an indication that I don’t make effective use of feedback.
Part 3, which I have not finished, is the part that will focus more on proofs. (Edit: It does not. But maybe there will be a future post that focuses on proofs as planned. It is however quite very relevant to the topic of proofs the way I think of things.)
Any help from anyone in reading over would be appreciated, but at the same time it is not expected :)
A draft for part 3 is now ready: https://docs.google.com/document/d/1S0HsaImfJHtyz2wo59S9n2doefyKQfP9Yp6A1XxXEwM/edit#.
Length-wise it is a bit of a monstrosity, despite your warnings.
I do intend to later write a post that is quicker and easier to read (but gets less into detail).
Feedback would be welcomed, but it’s definitely not expected (so I’m sharing it here just in case) 🙂
Is this really true? I would guess that we might be so far from solving alignment that nothing would look particularly helpful? Or, even worse, the only thing that would look helpful is something completely wrong?
Thanks for commenting :)
> I would guess that we might be so far from solving alignment that nothing would look particularly helpful?
My thinking is that using reinforcement-learning-like methods will select towards systems that look like they are aligned / optimized for what we are trying to optimize them for. If the system gives answers/solutions/etc where humans can see that it doesn’t really optimize well for what we want it to optimize for, then I presume it would be tweaked further until that no longer was the case. For example, suppose that we get it to write code for us, and we select how easy the code is to read/understand for humans as an optimization-criteria, and then it returns code that has clear examples of things that could be improved—well, then it would presumably be tweaked further (and it would presumably be a bad strategy for it if it was trying to trick us into thinking it was aligned).
That being said, it is a possibility that a system could be pretending that it’s less capable than it really is. That way it could return answers that didn’t look particularly helpful. Which I guess sort of makes the answer to your question “yes”. As in, there could be scenarios with AGI/superintelligence where it is “pretending” to not be an AGI/superintelligence. And then there could be other scenarios where it’s not hiding that it’s really capable, but where it does pretend to be less capable than it really is at certain types of things. But there would presumably be incentives for it to not do that to such a degree that it became easy for humans to notice that it isn’t doing its best. (Btw, I am consciously leaving out a few types of conceivable scenarios so as to make this comment less jumbled.)
Most of this series will focus on scenarios where we know that the system in question is superintelligent, and where we had the foresight to “box” it in (making it hard to hack itself out through security vulnerabilities in operating system and that sort of stuff) well before it acquired anything resembling superintelligence (or something that plausibly might self-improve in a self-reinforcing feedback-loop towards anything resembling superintelligence).
> Or, even worse, the only thing that would look helpful is something completely wrong?
Yes, that is a major worry. And that is one of the main concerns/problems/difficulties that I’m hoping to address in this series (not the only one, but one of the major ones). That is to say, I want to outline strategies/techniques that are intended to make it as hard and as unlikely as possible for the AI-system to be able to trick us into thinking it is being helpful (giving us what we want) while it really isn’t.
Are any of the ideas in this series new?
TLDR: Many ideas in this series have been written about before by others. But it also seems to me that some of the ideas are new and important, or if not new then at least under-discussed. But I could be wrong.
There are many big and small ideas in this series. Many of them are have been described or alluded to elsewhere by others. But there are also ideas in this series that (1) seem to me like they probably are important and (2) I cannot remember having seen described elsewhere.
I’ve had a hobby interest in the topic superintelligence since 2009, and in the topic of AI Alignment since 2014. So I’ve read a lot of what has been written, but there is also a lot that I haven’t read. I could be presenting ideas in this series that seem new/interesting to me, but that actually are old.
Here are some writings I am aware of that I think are relevant to ideas in this series (some of them have influenced me, and some of them I know overlap quite a bit):
Eric Drexler has written extensively about principles and techniques for designing AGI-systems that don’t have agent-like behavior. In Comprehensive AI Services as General Intelligence he—well, I’m not goanna do his 200-page document justice, but one of the things he writes about is having AGI-systems consisting of more narrowly intelligent sub-systems (that are constrained and limited in what they do).
Paul Christiano and others have written about concepts such as Iterated Distillation and Amplification and AI safety techniques revolving around debate.
Several people have talked about using AIs to help with work on AI safety. For example, Paul Christiano writes: “I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research.”
To me both Eliezer’s and Paul’s intuitions about FOOM seem like they plausibly could be correct. However, I think my object-level intuitions are more similar to Eliezer’s than Paul’s. Which partly explains my interest in how we might go about getting help from an AGI-system with alignment after it has become catastrophically dangerous. If AI-systems help us to more or less solve alignment before they become catastrophically dangerous, then that would of course be preferable—and, I dunno, perhaps they will, but I prefer to contemplate scenarios where they don’t.
Several people have pointed out that verification often is easier than generation, and that this is a principle we can make heavy use of in AI Alignment. Paul Christiano writes: “Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. (...) I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.”
It seems to me as well that Eliezer is greatly underestimating the power of verifiability. At the same time, I know Eliezer is really smart and has thought deeply about AI Alignment. Eliezer seems to think quite differently from me on this topic, but the reasons why are not clear to me, and this irks me.
Steve Omohundro (a fellow enthusiast of proofs, verifiability, and of combining symbolic and connectionist systems) has written about Safe-AI Scaffolding. He writes: “Ancient builders created the idea of first building a wood form on top of which the stone arch could be built. Once the arch was completed and stable, the wood form could be removed. (...) We can safely develop autonomous technologies in a similar way. We build a sequence of provably-safe autonomous systems which are used in the construction of more powerful and less limited successor systems. The early systems are used to model human values and governance structures. They are also used to construct proofs of safety and other desired characteristics for more complex and less limited successor systems.”
As John Maxwell pointed out to me, Stuart Armstrong has written about Low-Bandwidth Oracles. In the same comment he wrote “I’m probably not the best person to red team this since some of my own alignment ideas are along similar lines”.
In his book Superintelligence, Nick Bostrom writes: “We could ask such an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct. Many mathematical problems are of this kind. If we are wondering whether a mathematical proposition is true, we could ask the oracle to produce a proof or disproof of the proposition. Finding the proof may require insight and creativity beyond our ken, but checking a purported proof’s validity can be done by a simple mechanical procedure. (...) For instance, one might ask for the solution to various technical or philosophical problems that may arise in the course of trying to develop more advanced motivation selection methods. If we had a proposed AI design alleged to be safe, we could ask an oracle whether it could identify any significant flaw in the design, and whether it could explain any such flaw to us in twenty words or less. (...) Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria.”
There are at least a couple of posts on LessWrong with titles such as Bootstrapped Alignment.
I am reminded a bit of experiments where people are told to hum a melody, and how it’s often quite hard for others to guess what song you are humming. AI Alignment discussion feels to me a bit like that sometimes—it’s hard to convey exactly what I have in mind, and hard to guess exactly what others have in mind. Often there is a lot of inferential distance, and we are forced to convey great networks of thoughts and concepts over the low bandwidth medium of text/speech.
I am also reminded a bit about Holden Karnofsky’s article Thoughts on the Singularity Institute (SI), where he wrote:
In his response to this, Eliezer wrote the following:
In conclusion, I wouldn’t write this series if I didn’t think it could be useful. But as a fellow hairless monkey trying to do his best, it’s hard for me to confidently distinguish ideas that are new/helpful from ideas that are old/obvious/misguided. I appreciate any help in disabusing me of false beliefs.