Useful Does Not Mean Secure
Brief summary of what I’m trying to do with this post:
Contrast a “Usefulness” focused approach to building AI with a “Security” focused approach, and try to give an account of where security problems come from in AI.
Show how marginal transparency improvements don’t necessarily improve things from the perspective of security.
Describe what research is happening that I think is making progress from the perspective of security.
In this post I will be attempting to taboo the term ‘alignment’, and just talk about properties of systems. The below is not very original, I’m often just saying things in my own words that Paul and Eliezer have written, in large part just to try to think through the considerations myself. My thanks to Abram Demski and Rob Bensinger for comments on a draft of this post, though this doesn’t mean they endorse the content or anything.
Useful Does Not Mean Secure
This post grew out of a comment thread elsewhere. In that thread, Ray Arnold was worried that there was an uncanny valley of how good we are at understanding and building AI where we can build AGI but not a safe AGI. Rohin Shah replied, and I’ll quote from his reply:
Consider instead this worldview:
The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won’t work, or will have unintended side effects.
(This worldview can apply to far more than AI; e.g. it seems right in basically every STEM field. You might argue that putting things together randomly seems to work surprisingly well in AI, to which I say that it really doesn’t, you just don’t see all of the effort where you put things together randomly and it simply flat-out fails.)
The argument “it’s good to for people to understand AI techniques better even if it accelerates AGI” is a very straightforward non-clever consequence of this worldview.
[...]
Under the worldview I mentioned, the first-order effect of better understanding of AI systems, is that you are more likely to build AI systems that are useful and do what you want.
A lot of things Rohin says in that thread make sense. But in this post, let me point to a different perspective on AI that I might consider, if I were to focus entirely on Paul Christiano’s model of greedy algorithms in part II of his post on what failure looks like. That perspective sounds something like this:
The way you build things that are useful and do what you want, when you’re in an environment with much more powerful optimisers than you, is to spend a lot of extra time making them secure against adversaries, over and above simply making them useful. This is so that the other optimisers cannot exploit your system to achieve their own goals.
If you build things that are useful, predictable, and don’t have bad side-effects, but are subject to far more powerful optimisation pressures than you, then by default the things you build will be taken over by other forces and end up not being very useful at all.
An important distinction about artificial intelligence research is that you’re not simply competing against other humans, where you have to worry about hackers, governments and political groups, but that the core goal of artificial intelligence research is the creation of much more powerful general optimisers than currently exist within humanity. This is a difference in kind from all other STEM fields.
Whereas normal programming systems that aren’t built quite right are more likely to do dumb things or just break, when you make an AI system that isn’t exactly what you wanted, the system might be powerfully optimising for other targets in a way that has the potential to be highly adversarial. In discussions of AI alignment, Stuart Russell often likes to use an analogy to “building bridges that stay up” being an entirely integrated field, not distinct from bridge building. To extend the analogy a little, you might say the field of AI is unusual in that if you don’t quite make the bridge well enough, the bridge itself may actively seek out security vulnerabilities that bring the bridge down, then hide them from your attention until such a time as it has the freedom to take the bridge down in one go, and then take out all the other bridges in the world.
Now, talk of AI necessarily blurs the line between ‘external optimisation pressures’ and ‘the system is useful and does what you want’ because the system itself is creating the new, powerful optimisation pressure that needs securing against. Paul’s post on what failure looks like talks about this, so I’ll quote it here:
Modern ML instantiates massive numbers of cognitive policies, and then further refines (and ultimately deploys) whatever policies perform well according to some training objective. If progress continues, eventually machine learning will probably produce systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals.
Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.
How frequently will we run into influence-seeking policies, vs. policies that just straightforwardly pursue the goals we wanted them to? I don’t know.
You could take the position that, even though security work is not normally central to a field, this new security work is already central to this field, so increasing the ability to build ‘useful’ things will naturally have to solve this novel security work, so the field of AI will get it right by default.
This is my understanding of Paul’s mainline expectation (based on his estimates here and that his work is based around making useful / well motivated AI described here, here and in Rohin’s comment on that post) and also my understanding of Rohin’s mainline expectation (based on his estimates here). My understanding is this still means there’s a lot of value on the table from marginal work, so both of them work on the problem, but by default they expect the field to engage with this problem and do it well.
Restatement: In normal tech companies, there’s a difference between “making useful systems” and “making secure systems”. In the field of AI, “making useful systems” includes potentially building powerful adversaries, which involves novel security problems, so you might expect that executing the standard “make useful systems” will result in solving the novel security features.
For example, in a debate on instrumental convergence between various major AI researchers, this was also the position that Francesca Rossi took:
Stuart, I agree that it would easy to build a coffee fetching machine that is not aligned to our values, but why would we do this? Of course value alignment is not easy, and still a research challenge, but I would make it part of the picture when we envision future intelligent machines.
However, Yann LeCun said something subtly different:
One would have to be rather incompetent not to have a mechanism by which new terms in the objective could be added to prevent previously-unforeseen bad behavior.
Yann is implicitly taking the stance that there will not be powerful adversarial pressures exploiting such unforeseen differences in the objective function and humanity’s values. His responses are of the kind “We wouldn’t do that” and “We would change it quickly when those problems arose”, but not “Here’s how you build a machine learning system that cannot be flawed in this way”. It seems to me that he does not expect there to be any further security concerns of the type discussed above. If I pointed out a way that your system would malfunction, it is sometimes okay to say “Oh, if anyone accidentally gives that input to the system, then we’ll see and fix any problems that occur”, but if your government computer system is not secure, then by the time you’ve noticed what’s happening, a powerful adversary is inside your system and taking actions against you.
(Though I should mention that I don’t think this is the crux of the matter for Yann. I think his key disagreement is that he thinks we cannot talk usefully about safe AGI design before we know how to build an AGI—he doesn’t think that prosaic AI alignment is in principle feasible or worth thinking about.)
In general, it seems to me that if you show me how an AI system is flawed, if my response is to simply patch that particular problem then go back to relaxing, I am implicitly disbelieving that optimisation processes more powerful than human civilization will look for similar flaws and exploit them, as otherwise my threat level would go up drastically.
To clarify what this worry looks like: advances in AGI are hopefully building systems that can scale to being as useful and intelligent as is physically feasible in our universe—optimisation power way above that of human civilization’s. As you start getting smarter, you need to build more into your system to make sure the smart bits can’t exploit the system for their own goals. This assumes an epistemic advantage, as Paul says in the Failure post:
Attempts to suppress influence-seeking behavior (call them “immune systems”) rest on the suppressor having some kind of epistemic advantage over the influence-seeker. Once the influence-seekers can outthink an immune system, they can avoid detection and potentially even compromise the immune system to further expand their influence. If ML systems are more sophisticated than humans, immune systems must themselves be automated. And if ML plays a large role in that automation, then the immune system is subject to the same pressure towards influence-seeking.
There’s a notion whereby if you take a useful machine learning system, and you just make it more powerful, what you’re essentially doing is increasing the intelligence of the optimisation forces passing through it, including the adversarial optimisation forces. As you take the system and make it vastly superintelligent, your primary focus needs to be on security from adversarial forces, rather than primarily on making something that’s useful. You’ve become an AI security expert, not an AI usefulness expert. The important idea is that AI systems can break at higher levels of intelligence, even if they’re currently quite useful.
As I understand it, this sort of thing happened at Google, who first were a computer networks experts, and then became security experts, because for a while the main changes they made to Google Search were to increase security and make it harder for people to game the pagerank system. The adversarial pressures on them have since hit terminal velocity and there probably won’t be any further increases, unless and until we build superintelligent AI (be it general or the relevant kind of narrow.)
Marginal Transparency Does Not Mean Marginal Security
A key question in figuring out whether to solve this security problem via technical research (as opposed to global coordination) is whether a line of work differentially makes this sort of security from optimisation powers easier to work on, or whether it simply makes the system more useful in ways that don’t account for the novel adversaries.
In the quote from Rohin above, and by other people in other places, it sounds to me like the case being made is that the system is generally becoming more understandable, so it’s both easier to build and easier to secure. All improvements in understanding are improvements in being able to ‘do useful things’. And indeed my general prior is that increasing understanding is always better. But one of the core arguments Nick Bostrom makes is that not all scientific insights are positive for a civilisation at its current level of wisdom, and that we need to be able to order which insights we get. There are many types of ‘increased understanding’, and it’s not the case that all of them are positive on the current margin.
Let me give a concrete example where more transparency increases usefulness but not security.
It’s sometimes the case that, in prisons, the prison staff will attempt to get the prisoners to help out with tasks, errands and maintenance around the prison. However, if you send them into offices and parts of the prison but don’t watch what they do, then they may conspire against the prison staff—build things that can be used as weapons against the guards, build tunnels, use computers to contact the outside world, etc.
Now, if you start looking at what the prisoners are doing when they’re in office block #17, and you have basic rewards/punishments (e.g. giving or removing access to special foods and entertainment things), then you can reward them for doing good work there, and punish them for doing bad work there. You can also punish them for conspiring against you there.
However, this hasn’t really helped. if they were planning to build a tunnel in office block #17, they will now just build a tunnel in office block #18 instead. If I can start to see what they do all the time in some locations, then I can punish the ones that aren’t doing broadly the right work, and generally make sure they’re doing useful work. But this doesn’t mean that in the places I can’t look that they’re not conspiring against me. It seems plausible to me that many improvements in transparency, while making it easier to make the people do the prison’s work, do not make it harder to conspire against me, and just move around where the conspiring is happening.
If you’re trying to delegate and incentivise people to do labour for you, you might just think that you should bring in a good manager, to get people to do useful work. But in the prison situation, you primarily need to become a security expert, over and above being an expert in how to manage well. In this situation, there are many improvements in transparency of what they’re doing that helps force them to do useful work, that doesn’t secure them against conspiring with each other to break out of the system.
With machine learning systems, we already have all the weights on the NN’s to look at, so the system is maximally transparent. We can see everything, for certain values of ‘see’. I think the relevant question is “on what level you can understand what’s going on”. As we get higher-level understanding, we can maybe start to figure out if it’s doing certain bad things, or certain good things, and punish/reward those. But just because you’re making sure that the process will do something useful (e.g. invest money, run a hospital, classify images) doesn’t mean I know how to tell whether this will lead to the type of full understanding that means that adversarial work can’t be moved to areas that are too hard / very costly for me to understand.
Restatement: Marginal improvements in understandability and transparency can make it much easier to make useful systems but it’s not necessarily the case that it produces a meaningful difference in the ability to produce secure systems. It will allow us, at increasingly higher levels of understanding, to be able to change the type of work needed to exploit the system; this is not the same as a design that is safe no matter how powerful the optimisation power against us.
I wrote this in response to Ray trying to figure out how to tell whether any given type of machine learning research is making differential progress. The specific type of research discussed in that thread has a more detailed story which I won’t go into here, and mostly seems very helpful from my layman perspective, but I think that research “either being of zero impact, or else making the whole field more transparent/understandable” does not mean that the research makes differential progress on making the system secure. Marginal transparency can increase usefulness without increasing security.
In one sense, a machine learning system is maximally transparent—I can see every part of what it is doing. But while I don’t understand its reasoning, while there are levels on which I don’t know what it’s thinking, by default I’m not confident that adversarial thought hasn’t just moved there instead.
Current Technical Work on Security
From this perspective, let me talk about the research that seems like it’s aiming to help on the security front. This is not all the work being done, just the work that I feel I understand well enough to summarise from this perspective.
My understanding is that the main work attempting to pinpoint where optimisation enters the system in surprising ways is Hubinger, Mikulik, Skalse, van Merwijk and Garrabrant’s work on risks from learned optimisation (paper, sequence). This gives lots of names to concepts describing how optimisers work, and asks questions like:
Under what conditions will my learned algorithm itself do optimisation?
When the learned algorithm does optimisation, what will its objective be, and what will the relationship be between its objective and the loss function of the neural net that produced it?
If the learned optimiser has successfully built a model of the objective function that was used to build it, what conditions predict whether it will it be working around my objective as opposed to toward it?
When should I expected the optimiser in the learned algorithm to try to deceive me?
The paper also asks whether it’s possible to prevent influence-seeking algorithms from entering your systems by creating complexity measures on the system, such as time and space penalties. On this topic, Paul Christiano has asked whether requiring systems be maximally efficient according to circuit description length removes all adversarial behaviour; and Evan has offered an answer in the negative.
It’s also the case that the Agent Foundations team at MIRI is trying to think about the problem of inner alignment more broadly, and poke at various concepts around here, such as in their writeups on Robust Delegation and Subsystem Alignment. This explores many simple background questions to which we don’t have principled answers, and cannot draw toy models of intelligent agents that reliably get these problems right.
Is there a principled way to figure out whether I should trust that something more intelligent than me shares my values, given that I can’t figure out exactly what it’s going to do? If I am a child, sometimes adults will do something that the opposite of what I want—is there a way of figuring out whether they’re doing this in accordance with my goals?
How should I tell a more intelligent agent than me what I want it to do, given that I don’t know everything about what I want? This is especially hard given that optimisation amplifies the differences between what I say I want and what I actually want (aka Goodhart’s Law).
How do I make sure the different parts of a mind are in a good balance, rather than some parts overpowering other parts? When it comes to my own mind, sometimes different parts get out of whack and I become too self-critical, or overconfident, or depressed, or manic. Is there a principled way of thinking about this?
How do I give another agent a good description of what to do in a domain, without teaching them everything I know about the domain? This is a problem in companies, where sometimes people who don’t understand the whole vision can make bad tradeoffs.
That’s the work I feel I have a basic understanding of. I’m curious about explanations of how other work fits into this framework.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Modeling the impact of safety agendas by 5 Nov 2021 19:46 UTC; 51 points) (
- My Updating Thoughts on AI policy by 1 Mar 2020 7:06 UTC; 20 points) (
- [AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations by 4 Dec 2019 18:10 UTC; 14 points) (
- 30 Nov 2019 2:12 UTC; 4 points) 's comment on Chris Olah’s views on AGI safety by (
It’s a little more disjunctive:
Maybe the problem is too difficult. Then we could coordinate to avoid the problem. This could mean not building powerful AI systems at all, limiting the types of AI systems we build, etc.
Maybe it’s not actually a problem. Humans seem kinda sorta goal-directed, and frequently face existential angst over what the meaning of life is. Maybe powerful AI systems will similarly be very capable but not have explicit goals.
Maybe there isn’t differentially powerful optimization. We build somewhat-smarter-than-human AI systems, and these AI systems enable us to become more capable ourselves (just as the Internet has made us more capable), and our capabilities increase alongside the AI’s capabilities, and there’s never any optimization that’s way more powerful than us.
Maybe there isn’t powerful adversarial optimization. We figure out how to solve the alignment problem for small increases in intelligence; we use this to design the first smarter-than-human systems, they use it to design their successors, etc. to arbitrary levels of capabilities.
Maybe hacks are enough. Every time we notice a problem, we just apply a “band-aid” fix, nothing that seems principled. But this turns out to be enough. (I don’t like this argument, because it’s unclear what a “band-aid” fix is—for some definitions of “band-aid” I’d feel confident that “band-aids” would not be enough. But there’s something along these lines.)
Maybe, I’m not sure. Regardless of the underlying explanation, if everyone sounded like Yann I’d be more worried. (ETA: Well, really I’d spend a bunch of time evaluating the argument more deeply, and form a new opinion, but assuming I did that and found the argument unconvincing, then I’d be more worried.)
I agree if you assume a discrete action that simply causes the system to become vastly superintelligent. But we can try not to get to powerful adversarial optimization in the first place; if that never happens then you never need the security. (As a recent example, relaxed adversarial training takes advantage of this fact.) In the previous list, bullet points 3 and 4 are explicitly about avoiding powerful adversarial optimization, and bullet points 1 and 2 are about noticing whether or not we have to worry about powerful adversarial optimization and dealing with it if so. (Meta: Can we get good numbered lists? If we have them, how do I make them?)
Given how difficult security is, it seems better to aim for one of those scenarios. In practice, I do think any plan that involves building powerful AI systems will require some amount of security-like thought—for example, if you’re hoping to detect adversarial optimization to stop it from arising, you need a lot of confidence in the detector. But there isn’t literally strong adversarial optimization working against the detector—it’s more that if there’s a “random” failure, that turns into adversarial optimization, and so it becomes hard to correct the failure. So it seems more accurate to say that we need very low rates of failure—but in the absence of adversarial optimization.
(Btw, this entire comment is predicated on continuous takeoff; if I were convinced of discontinuous takeoff I’d expect my beliefs to change radically, and to be much less optimistic.)
Re:
And:
The latter summary in particular sounds superficially like Eliezer’s proposed approach, except that he doesn’t think it’s easy in the AGI regime to “just not build powerful adversarial optimizers” (and if he suspected this was easy, he wouldn’t want to build in the assumption that it’s easy as a prerequisite for a safety approach working; he would want a safety approach that’s robust to the scenario where it’s easy to accidentally end up with vastly more quality-adjusted optimization than intended).
The “do alignment in a way that doesn’t break if capability gain suddenly speeds up” approach, or at least Eliezer’s version of that approach, similarly emphasizes “you’re screwed (in the AGI regime) if you build powerful adversarial optimizers, and it’s a silly idea to do that in the first place, so just don’t do it, ever, in any context”. From AI Safety Mindset:
Omnipotence Test for AI Safety:
Non-Adversarial Principle:
Cf. the “X-and-only-X” problem.
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”. It’s not just (or even primarily) a fallback measure to keep you safe on the off chance your system does generate a powerful adversary. Quoting Nate:
Scott Garrabrant mentioned to me at one point that he thought Optimization Amplifies distills a (maybe the?) core idea in Security Mindset and Ordinary Paranoia. The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”, not from “lots of adversarial optimization” (even though the latter is a likely consequence of getting things badly wrong with the former).
Eliezer models most of the difficulty (and most of the security-relatedness) of the alignment problem as lying in ‘get ourselves to a place where in fact our systems don’t end up as powerful adversarial optimizers’, rather than (a) treating this as a gimme and focusing on what we should do absent such optimizers, or (b) treating the presence of adversarial optimization as inevitable and asking how to manage it.
I think this idea (“avoiding generating powerful adversarial optimizers is an enormous constraint and requires balancing on a knife’s edge between disaster and irrelevance”) is also behind the view that system safety largely comes from things like “the system can’t think about any topics, or try to solve any cognitive problems, other than the ones we specifically want it to”, vs. Rohin’s “the system is trying to do what we want”.
Tbc, I wasn’t modeling Eliezer / Nate / MIRI as saying “there will be powerful adversarial optimization, and so we need security”—it is in fact quite clear that we’re all aiming for “no powerful adversarial optimization in the first place”. I was responding to the arguments in this post.
I agree that’s desirable all else equal, but such an approach would likely require more time and effort (a lot more time and effort on my model). It’s an empirical question whether we have that time / effort to spare (and also perhaps it’s better to get AGI earlier to e.g. reduce other x-risks, in which case the marginal safety from not relying on the assumption may not be worth it).
(I mentioned coordination on not building AGI above—I think that might be feasible if the “global epistemic state” was that building AGI is likely to kill us all, but seems quite infeasible if our epistemic state is “everything we know suggests this will work, but it could fail if we somehow end up with more optimization”.)
Thanks, this is helpful for understanding MIRI’s position better. (I probably should have figured it out from Nate and Scott’s posts, but I don’t think I actually did.)
Broadly, my hope is that we actually see non-existentially-catastrophic failures caused by AIs going down “the paths you weren’t visualizing”, and this causes you to start visualizing the path. Obviously all else equal it’s better if we visualize it in the first place.
I think I also have a different picture of what powerful optimization will look like—the paperclip maximizer doesn’t seem like a good model for the sort of thing we’re likely to build. An approach based on some explicitly represented goal is going to be dead in the water well before it becomes even human-level intelligent, because it will ignore “common sense rules” again and again (c.f. most specification gaming examples). Instead, our AI systems are going to need to understand common sense rules somehow, and the resulting system is not going to look like it’s ruthlessly pursuing some simple goal.
For example, the resulting system may be more accurately modeled as having uncertainty about the goal (whether explicitly represented or internally learned). Weird + extreme states tend to only be good for a few goals, and so would not be good choices if you’re uncertain about the goal. In addition, if our AI systems are learning from our conventions, then they will likely pick up our risk aversion, which also tends to prevent weird + extreme states.
Finally, it seems like there’s a broad basin of corrigibility, that prevents weird + extreme states that humans would rate as bad. It’s not hard to figure out that humans don’t want to die, so any weird + extreme state you create has to respect that constraint. And there are many other such easy-to-learn constraints.
Oh this is very interesting. I’m updating from
to
I mean, I guess it’s obvious if you’ve read the security mindset dialogue, but I hadn’t realised that was a central element to the capabilities gain debate.
Added: To clarify further: Eliezer has said that explicitly a few times, but only now did I realise it was potentially a deep crux of the broader disagreement between approaches. I thought it was just a helpful but not especially key example of not taking assumptions about AI systems.
Eliezer also strongly believes that discrete jumps will happen. But the crux for him AFAIK is absolute capability and absolute speed of capability gain in AGI systems, not discontinuity per se (and not particular methods for improving capability, like recursive self-improvement). Hence in So Far: Unfriendly AI Edition Eliezer lists his key claims as:
(1) “Orthogonality thesis”,
(2) “Instrumental convergence”,
(3) “Rapid capability gain and large capability differences”,
(A) superhuman intelligence makes things break that don’t break at infrahuman levels,
(B) “you have to get [important parts of] the design right the first time”,
(C) “if something goes wrong at any level of abstraction, there may be powerful cognitive processes seeking out flaws and loopholes in your safety measures”, and the meta-level
(D) “these problems don’t show up in qualitatively the same way when people are pursuing their immediate incentives to get today’s machine learning systems working today”.
From Sam Harris’ interview of Eliezer (emphasis added):
See also:
Currently we’re ~1.5 months of work into moving to a new editor framework, which has them (amongst a bunch of other new things).
The GreaterWrong viewer for LessWrong allows you to write posts and comments using Markdown, allowing you to write numbered lists in the usual Markdown way.
You can also write markdown comments on LW, just set the “use markdown editor” in your user settings.
MIRI’s Undisclosed Work and Security
My understanding of MIRI’s undisclosed work is that they expect there are probably not findable ways to make sufficiently advanced machine learning systems not bring in powerful optimisers that exploit your system, and that gradient descent is not a feasible way to create optimisation power that is securable in this way, so they’re working on an alternative basis for dong optimisation. It sounds like they’re making some progress, but I can’t know if it’s fast enough or going to work out.
As to whether the work is likely to make AI systems more useful, MIRI’s write-up explains their thinking, under the header “It is difficult to predict whether successful deconfusion work could spark capability advances”. They give two examples of basic work that helps you reason in basic ways about a system, where one (interval arithmetic, which allows you to place hard bounds on the possible error of your calculations) has no ability to speed up a system, but the other (probability theory, which is central to understanding modern image classifiers) does. They’re not clear which kind of research they’ll end up producing, but I can see why one would be only helpful for security whereas the other is also helpful for capabilities research.
Planned newsletter summary:
Planned opinion:
(If you want to comment on my opinion, please do so as a reply to the other comment I made.)
ETA: Added a sentence about MIRI’s beliefs to the opinion.
Oh my, I never expected to be in the newsletter for writing an object level post about alignment. How exciting.