It seems to me that many people believe something like “We need proof-level guarantees, or something close to it, before we build powerful AI”. I could interpret this in two different ways:
Normative claim: “Given how bad extinction is, and the plausibility of AI x-risk, it would be irresponsible of us to build powerful AI before having proof-level guarantees that it will be beneficial”.
Empirical claim: “If we run a powerful AI system without having something like a proof of the statement ‘running this AI system will be beneficial’, then catastrophe is nearly inevitable”.
I am uncertain on the normative claim (there might be great benefits to building powerful AI sooner, including the reduction of other x-risks), and disagree with the empirical claim.
If I had to argue briefly for the empirical claim, it would go something like this: “Since powerful AI will be world-changing, it will either be really good, or really bad—neutral impact is too implausible. But due to fragility of value, the really bad outcomes are far more likely. The only way to get enough evidence to rule out all of the bad outcomes is to have a proof that the AI system is beneficial”. I’d probably agree with this if we had to create a utility function and give it to a perfect expected utility maximizer (and we couldn’t just give it something trivial like the zero utility function), but that seems to be drastically cutting down our options.
So I’m curious: a) are there any people who believe the empirical claim? b) If so, what are your arguments for it? c) How tractable do you think it is to get proof-level guarantees about AI?
My thoughts: we can’t really expect to prove something like “this ai will be beneficial”. However, relying on empiricism to test our algorithms is very likely to fail, because it’s very plausible that there’s a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don’t know how to make good guesses about the behavior of very capable systems except through mathematical analysis.
There are two overlapping traditions in machine learning. There’s a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there’s machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.
(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)
I don’t think you need to posit a discontinuity to expect tests to occasionally fail.
I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.
I’ll admit I don’t feel like I really understand the perspective of people who seem to think we’ll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:
We’ll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
counter-argument: but the concern is about what happens at deployment time
We’ll deploy AI in a box, too then
counter: seems like that entails a massive performance hit (but it’s not clear if that’s actually the case)
We’ll have other “AI police” to stop any “evil AIs” that “go rogue” (just like we have for people).
counter: where did the AI police come from, and why can’t they go rogue as well?
The “AI police” can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
counter: this seems to be assuming the “corrigibility as basin of attraction” argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.
A single failure isn’t likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. “satiable”) AI and make it an insatiable “open ended optimizer AI”.
counter: we can’t assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that’s needed; this seems like an open question
OK I could go on, but I’d rather actually hear from anyone who has this view! :)
I hold this view; none of those are reasons for my view. The reason is much more simple—before x-risk level failures, we’ll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We’ll notice this, understand it, and fix the issue.
(A crux I expect people to have is whether we’ll actually fix the issue or “apply a bandaid” that is only a superficial fix.)
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
So I’m not talking about agents who know their own actions because I think there’s going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
[...]
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.
I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc… I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.
A very similar problem would be a form of longer-term “seeding”, where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances (“at the margin”) that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.
I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can’t carry on a conversations, but can implement a very sophisticated covert world domination strategy.
It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc…
Now I’m wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
I also predict that there will be types of failure we will not notice, or will misinterpret. [...]
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).
I don’t think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism.
it’s very plausible that there’s a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems)
You can also be empirical at that point though? I suppose you couldn’t be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.
The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:
They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for “significant risk” vs “significant good”. IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to “not push the button” (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.
[This is an “or” condition—either one of those two conditions suffices for me to take vague arguments seriously.]
On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between “theory” and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a “more of both” type approach, implying a 2d picture where they occupy separate dimensions.
Still, though, I personally don’t see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today’s systems. I especially don’t see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.
Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for “significant risk” vs “significant good”.
This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I’d want to think more about it (I haven’t because it doesn’t seem decision-relevant).
I especially don’t see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.
The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what’ll happen.)
I am confused about how the normative question isn’t decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn’t? To be hopefully clear: I’m applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn’t mean direct proof of the claim “the AI will do good”, but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.
It’s possible that this isn’t my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here—if there weren’t such large downside risks, I would have lower standards of evidence for claims that things will go well.
The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what’ll happen.)
It sounds like we would have to have a longer discussion to resolve this. I don’t expect this to hit the mark very well, but here’s my reply to what I understand:
I don’t see how you can be confident enough of that view for it to be how you really want to check.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out “hacks” around the “usual interpretation” of the proxy.
I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren’t the only two things in the universe). I’m not sure which of those two disagreements is more important here.
To be hopefully clear: I’m applying this normative claim to argue that proof is needed to establish the desired level of confidence.
Under my model, it’s overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I’d be extremely interested in the normative claim. If I then agreed with the normative claim, I’d agree with:
proof is needed to establish the desired level of confidence. That doesn’t mean direct proof of the claim “the AI will do good”, but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.
I don’t see how you can be confident enough of that view for it to be how you really want to check.
If I want >99% confidence, I agree that I couldn’t be confident enough in that argument.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out “hacks” around the “usual interpretation” of the proxy.
Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to “hacks” around the “usual interpretation”), and have some good reason to think that it won’t happen with the highly capable system they are planning to deploy.
I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be
Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.
Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I’ve found to a statement of the empirical claim is from Eliezer’s 2012 Reply to Holden on ‘Tool AI’:
I’ve repeatedly said that the idea behind proving determinism of self-modification isn’t that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed.
But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI.
Note that the above talked about “stable self-modification” instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven’t really found a serious proposal to try to formalize and prove the latter kind of statement.
IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI’s safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like “stable self-modification”, and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn’t seem to make sense as written.)
The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it’s too early to reach a strong conclusion one way or another as to which approach is more advisable.
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
An AI built in the Artificial General Intelligence paradigm, in which the design is
engineered de novo, has the advantage over humans with respect to transparency of
disposition, since it is able to display its source code, which can then be reviewed for
trustworthiness (Salamon, Rayhawk, and Kramár 2010; Sotala 2012). Indeed, with an
improved intelligence, it might find a way to formally prove its benevolence. If weak
early AIs are incentivized to adopt verifiably or even provably benevolent dispositions,
these can be continually verified or proved and thus retained, even as the AIs gain in
intelligence and eventually reach the point where they have the power to renege without
retaliation (Hall 2007a).
When constructing intelligent systems which learn
and interact with all the complexities of reality, it is not
sufficient to verify that the algorithm behaves well in test
settings. Additional work is necessary to verify that the
system will continue working as intended in application.
This is especially true of systems possessing general
intelligence at or above the human level: superintelligent
machines might find strategies and execute plans beyond
both the experience and imagination of the programmers,
making the clever oscillator of Bird and Layzell look
trite. At the same time, unpredictable behavior from
smarter-than-human systems could cause catastrophic
damage, if they are not aligned with human interests
(Yudkowsky 2008).
Because the stakes are so high, testing combined
with a gut-level intuition that the system will continue
to work outside the test environment is insufficient, even
if the testing is extensive. It is important to also have
a formal understanding of precisely why the system is
expected to behave well in application.
What constitutes a formal understanding? It seems
essential to us to have both (1) an understanding of
precisely what problem the system is intended to solve;
and (2) an understanding of precisely why this practical
system is expected to solve that abstract problem. The
latter must wait for the development of practical smarter than-human systems, but the former is a theoretical
research problem that we can already examine.
I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
In point of fact, the real reason the author is listing out this methodology is that he’s currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won’t want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”
Though this itself is not really a reduced statement and still has too much goal-laden language in it.
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
Not sure who you have in mind as people believing this
I don’t have particular people in mind, it’s more of a general “vibe” I get from talking to people. In the past, when I’ve stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.
a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.
b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can’t just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise.
Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don’t think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don’t know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.
c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.
Also, the proof can rely on mathematical conjectures in which we have high confidence, such as P≠NP. Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.
I agree with a). c) seems to me to be very optimistic, but that’s mostly an intuition, I don’t have a strong argument against it (and I wouldn’t discourage people who are enthusiastic about it from working on it).
The argument in b) makes sense; I think the part that I disagree with is:
moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there).
The counterargument is “current AI systems don’t look like long term planners”, but of course it is possible to respond to that with “AGI will be very different from current AI systems”, and then I have nothing to say beyond “I think AGI will be like current AI systems”.
Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).
But, I agree that these arguments are not enough to be confident of the strong empirical claim.
I believe the empirical claim. As I see it, the main issue is Goodhart: an AGI is probably going to be optimizing something, and open-ended optimization tends to go badly. The main purpose of proof-level guarantees is to make damn sure that the optimization target is safe. (You might imagine something other than a utility-maximizer, but at the end of the day it’s either going to perform open-ended optimization of something, or be not very powerful.)
The best analogy here is something like an unaligned wish-granting genie/demon. You want to be really careful about wording that wish, and make sure it doesn’t have any loopholes.
I think the difficulty of getting those proof-level guarantees is more conceptual than technical: the problem is that we don’t have good ways to rigorously express many of the core ideas, e.g. the idea that physical systems made of atoms can “want” things. Once the core problems of embedded agency are resolved, I expect the relevant guarantees will not be difficult.
The second case is simpler. Think about it in analogy to a wish-granting genie/demon: if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon? I certainly wouldn’t bet on it. The problem here is that the AI is smarter than we are, and can find loopholes we will not think of.
The first case is more subtle, because most of the complexity is hidden under a human-intuitive abstraction layer. If we had an unaligned genie/demon and said “I wish for you to passively study me for a year, learn what would make me most happy, and then give me that”, then that might be a safe wish—assuming the genie/demon already has an appropriate understanding of what “happy” means, including things like long-term satisfaction etc. But an AI will presumably not start with such an understanding out the gate. Abstractly, the AI can learn its optimization target, but in order to do that it needs a learning target—the thing it’s trying to learn. And that learning target is itself what needs to be aligned. If we want the AI to learn what makes humans “happy”, in a safe way, then whatever it’s using as a proxy for “happiness” needs to be a safe optimization target.
On a side note, Yudkowsky’s “The Hidden Complexity of Wishes” is in many ways a better explanation of what I’m getting at. The one thing it doesn’t explain is how “more powerful” in the sense of “ability to grant more difficult wishes” translates into a more powerful optimizer. But that’s a pretty easy jump to make: wishes require satisficing, so we use the usual approach of a two-valued utility function.
if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon?
I wasn’t imagining just input-output tests in laboratory conditions, which I agree are insufficient. I was thinking of studying counterfactuals, e.g. what the optimization target would suggest doing under the hypothetical scenario where it has lots of power. Alternatively, you could imagine tests of the form “pose this scenario, and see how the AI thinks about it”, e.g. to see whether the AI runs a check for whether it can deceive humans. (Yes, this assumes strong interpretability techniques that we don’t yet have. But if you want to claim that only proofs will work, you either need to claim that interpretability techniques can never be developed, or that even if they are developed they won’t solve the problem.)
Also, I probably should have mentioned this in the previous comment, but it’s not clear to me that it’s accurate to model AGI as an open-ended optimizer, in the same way that that’s not a great model of humans. I don’t particularly want to debate that claim, because those debates never help, but it’s a relevant fact to understanding my position.
I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work. So we might disagree less on that front than it first seemed.
The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don’t think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more “difficult” wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes—i.e. the chunks of state-space which they can access just don’t contain really bad outcomes.
Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with
if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work.
The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn’t.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don’t do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted—e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you’ve got the insights and the sorting algorithm in hand, the proof isn’t actually that much extra work, although it will still take some experts chewing on it a bit to make sure it’s correct.
That’s the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be “aligned”. Once those are figured out, I don’t expect proofs to be much harder than algorithms. Coming back to the “see whether the AI runs a check for whether it can deceive humans” example, the proof wouldn’t involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction—just like we write sorting algorithms so that they will always pass an is_sorted() check by construction.
Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don’t have a proof.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
This was also what I was imagining. (Well, actually, I was also considering more than two years.)
we are missing some fundamental insights into what it means to be “aligned”.
It sounds like our disagreement is the one highlighted in Realism about rationality. When I say we could check whether the AI is deceiving humans, I don’t mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of “deception” that gives us a perfect checker. I don’t think notions like “deception”, “aligned”, “want”, “optimize”, etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult.
Jumping back to the original question, it sounds like the reason that you think that if we don’t have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right?
If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality—maybe physical phenomena aren’t amenable to crisp formal definitions, but “alignment” is.
It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the “realism about rationality” examples in that post—e.g. I don’t think AGI will necessarily be an “agent”, I don’t think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I’m not bothered by moral intuitions containing contradictions, etc).
I would also describe my “no proofs ⇒ doomed” view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don’t have the proofs, it’s highly unlikely that we understand the system well enough to usefully predict whether it is safe—but the proofs themselves play a relatively minor role.
I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are understood, the most likely errors are either simple mistakes (e.g. metric/standard units mixup) or missing some fundamental phenomenon (e.g. aerodynamics of a bridge). Those are the sort of problems which testing is good at catching. Testing is a double-check that we haven’t missed something critical; it is not the primary basis for thinking the system is safe.
A simple example, in contrast to AI: every engineering discipline I know of uses “safety factors”—i.e. make a beam twice as strong as it needs to be, give a wire twice the current capacity it needs, etc. A safety factor of 2 is typical in a wide variety of engineering fields. In AI, we cannot use safety factors because we do not even know what number we could double to make the AI more safe. Today, given any particular aspect of an AI system, we do not know whether adjusting any particular parameter will make the AI more or less reliable/risky.
The problem with tests is that the AI behaving well when weak enough to be tested doesn’t guarantee it will continue to do so.
If you are testing a system, that means that you are not confidant that it is safe. If it isn’t safe, then your only hope is for humans to stop it. Testing an AI is very dangerous unless you are confidant that it can’t harm you.
A paperclip maximizer would try to pass your tests until it was powerful enough to trick its way out and take over. Black box testing of arbitrary AI’s gets you very little safety.
Also some peoples intuitions think that a smile maximizing AI is a good idea. If you have a straightforward argument that appeals to the intuitions of the average Joe Blogs, and can’t be easily formalized, then I would take the difficulty formalizing it as evidence that the argument is not sound.
If you take a neural network and train it to recognize smiling faces, then attach that to AIXI, you get a machine that will appear to work in the lab, when the best it can do is make the scientists smile into its camera. There will be an intuitive argument about how it wants to make people smile, and people smile when they are happy. The AI will tile the universe with cameras pointed at smiley faces as soon as it escapes the lab.
A slightly misspecified reward function can lead to anything from perfectly aligned behavior to catastrophic failure. So I think we need much stronger and more formal arguments to believe that catastrophe is almost inevitable than EY’s genie post provides.
I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don’t have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.
I hold a nuanced view that I believe is more similar to the empirical claim than your views.
I think what we want is an extremely high level of justified confidence that any AI system or technology that is likely to become widely available is not carrying a significant and non-decreasing amount of Xrisk-per-second. And it seems incredibly difficult and likely impossible to have such an extremely high level of justified confidence.
Formal verification and proof seem like the best we can do now, but I agree with you that we shouldn’t rule out other approaches to achieving extreme levels of justified confidence. What it all points at to me is the need for more work on epistemology, so that we can begin to understand how extreme levels of confidence actually operate.
I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I’m epistemically humble enough these days to think it’s not reasonable to say “nearly inevitable” if you integrate out epistemic uncertainty.
But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?
Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?
Partly (I’m surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.
But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
I’m not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I’m actually more interested in hearing your take on those lines of argument than saying mine ATM :P
Re: convergent rationality, I don’t buy it (specifically the “convergent” part).
Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.
But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”. I think the strongest argument against is “when you have an adversary optimizing against you, nothing short of proofs can give you confidence”, which seems to be somewhat true in security. But then I think there are ways that you can get confidence in “the AI system will not adversarially optimize against me” using techniques that are not proofs.
(Note the alternative to proofs is not trial and error. I don’t use trial and error to successfully board a flight, but I also don’t have a proof that my strategy is going to cause me to successfully board a flight.)
But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”.
Totally agree; it’s an under-appreciated point!
Here’s my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don’t actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)
The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.
This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.
I’m personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the “early days” of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of “social epistemology” would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I’d argue we’re in the process of failing catastrophically at that)
It seems to me that many people believe something like “We need proof-level guarantees, or something close to it, before we build powerful AI”. I could interpret this in two different ways:
Normative claim: “Given how bad extinction is, and the plausibility of AI x-risk, it would be irresponsible of us to build powerful AI before having proof-level guarantees that it will be beneficial”.
Empirical claim: “If we run a powerful AI system without having something like a proof of the statement ‘running this AI system will be beneficial’, then catastrophe is nearly inevitable”.
I am uncertain on the normative claim (there might be great benefits to building powerful AI sooner, including the reduction of other x-risks), and disagree with the empirical claim.
If I had to argue briefly for the empirical claim, it would go something like this: “Since powerful AI will be world-changing, it will either be really good, or really bad—neutral impact is too implausible. But due to fragility of value, the really bad outcomes are far more likely. The only way to get enough evidence to rule out all of the bad outcomes is to have a proof that the AI system is beneficial”. I’d probably agree with this if we had to create a utility function and give it to a perfect expected utility maximizer (and we couldn’t just give it something trivial like the zero utility function), but that seems to be drastically cutting down our options.
So I’m curious: a) are there any people who believe the empirical claim? b) If so, what are your arguments for it? c) How tractable do you think it is to get proof-level guarantees about AI?
My thoughts: we can’t really expect to prove something like “this ai will be beneficial”. However, relying on empiricism to test our algorithms is very likely to fail, because it’s very plausible that there’s a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don’t know how to make good guesses about the behavior of very capable systems except through mathematical analysis.
There are two overlapping traditions in machine learning. There’s a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there’s machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.
(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)
I don’t think you need to posit a discontinuity to expect tests to occasionally fail.
I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.
I’ll admit I don’t feel like I really understand the perspective of people who seem to think we’ll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:
We’ll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
counter-argument: but the concern is about what happens at deployment time
We’ll deploy AI in a box, too then
counter: seems like that entails a massive performance hit (but it’s not clear if that’s actually the case)
We’ll have other “AI police” to stop any “evil AIs” that “go rogue” (just like we have for people).
counter: where did the AI police come from, and why can’t they go rogue as well?
The “AI police” can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
counter: this seems to be assuming the “corrigibility as basin of attraction” argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.
A single failure isn’t likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. “satiable”) AI and make it an insatiable “open ended optimizer AI”.
counter: we can’t assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that’s needed; this seems like an open question
OK I could go on, but I’d rather actually hear from anyone who has this view! :)
I hold this view; none of those are reasons for my view. The reason is much more simple—before x-risk level failures, we’ll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We’ll notice this, understand it, and fix the issue.
(A crux I expect people to have is whether we’ll actually fix the issue or “apply a bandaid” that is only a superficial fix.)
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
This is also the topic of The Rocket Alignment Problem.
Interesting. Your crux seems good; I think it’s a crux for us. I expect things play out more like Eliezer predicts here: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D&hc_location=ufi
I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc… I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.
A very similar problem would be a form of longer-term “seeding”, where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances (“at the margin”) that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.
I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can’t carry on a conversations, but can implement a very sophisticated covert world domination strategy.
Now I’m wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).
I don’t think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism.
You can also be empirical at that point though? I suppose you couldn’t be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.
The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:
They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for “significant risk” vs “significant good”. IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to “not push the button” (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.
[This is an “or” condition—either one of those two conditions suffices for me to take vague arguments seriously.]
On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between “theory” and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a “more of both” type approach, implying a 2d picture where they occupy separate dimensions.
Still, though, I personally don’t see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today’s systems. I especially don’t see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.
This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I’d want to think more about it (I haven’t because it doesn’t seem decision-relevant).
The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what’ll happen.)
I am confused about how the normative question isn’t decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn’t? To be hopefully clear: I’m applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn’t mean direct proof of the claim “the AI will do good”, but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.
It’s possible that this isn’t my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here—if there weren’t such large downside risks, I would have lower standards of evidence for claims that things will go well.
It sounds like we would have to have a longer discussion to resolve this. I don’t expect this to hit the mark very well, but here’s my reply to what I understand:
I don’t see how you can be confident enough of that view for it to be how you really want to check.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out “hacks” around the “usual interpretation” of the proxy.
I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren’t the only two things in the universe). I’m not sure which of those two disagreements is more important here.
Under my model, it’s overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I’d be extremely interested in the normative claim. If I then agreed with the normative claim, I’d agree with:
If I want >99% confidence, I agree that I couldn’t be confident enough in that argument.
Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to “hacks” around the “usual interpretation”), and have some good reason to think that it won’t happen with the highly capable system they are planning to deploy.
Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.
Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I’ve found to a statement of the empirical claim is from Eliezer’s 2012 Reply to Holden on ‘Tool AI’:
Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:
Note that the above talked about “stable self-modification” instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven’t really found a serious proposal to try to formalize and prove the latter kind of statement.
IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI’s safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like “stable self-modification”, and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn’t seem to make sense as written.)
The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it’s too early to reach a strong conclusion one way or another as to which approach is more advisable.
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
From Safety Engineering for Artificial General Intelligence:
Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda:
I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
MIRIx events are funded by MIRI, but we don’t decide the topics or anything. I haven’t taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don’t see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a “safety-story” in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
ETA: From the end of the latter dialogue,
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem , and the sketch of ‘why this looks like a hard problem in general’ in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies and https://arbital.com/p/aligning_adds_time/ .
I don’t have particular people in mind, it’s more of a general “vibe” I get from talking to people. In the past, when I’ve stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.
a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.
b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can’t just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise.
Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don’t think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don’t know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.
c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.
Also, the proof can rely on mathematical conjectures in which we have high confidence, such as P≠NP. Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.
I agree with a). c) seems to me to be very optimistic, but that’s mostly an intuition, I don’t have a strong argument against it (and I wouldn’t discourage people who are enthusiastic about it from working on it).
The argument in b) makes sense; I think the part that I disagree with is:
The counterargument is “current AI systems don’t look like long term planners”, but of course it is possible to respond to that with “AGI will be very different from current AI systems”, and then I have nothing to say beyond “I think AGI will be like current AI systems”.
Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).
But, I agree that these arguments are not enough to be confident of the strong empirical claim.
I believe the empirical claim. As I see it, the main issue is Goodhart: an AGI is probably going to be optimizing something, and open-ended optimization tends to go badly. The main purpose of proof-level guarantees is to make damn sure that the optimization target is safe. (You might imagine something other than a utility-maximizer, but at the end of the day it’s either going to perform open-ended optimization of something, or be not very powerful.)
The best analogy here is something like an unaligned wish-granting genie/demon. You want to be really careful about wording that wish, and make sure it doesn’t have any loopholes.
I think the difficulty of getting those proof-level guarantees is more conceptual than technical: the problem is that we don’t have good ways to rigorously express many of the core ideas, e.g. the idea that physical systems made of atoms can “want” things. Once the core problems of embedded agency are resolved, I expect the relevant guarantees will not be difficult.
Does it make a difference if the optimization target is itself being learned?
What if we have intuitive arguments + tests that suggest that the optimization target is safe?
Still unsafe, in both cases.
The second case is simpler. Think about it in analogy to a wish-granting genie/demon: if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon? I certainly wouldn’t bet on it. The problem here is that the AI is smarter than we are, and can find loopholes we will not think of.
The first case is more subtle, because most of the complexity is hidden under a human-intuitive abstraction layer. If we had an unaligned genie/demon and said “I wish for you to passively study me for a year, learn what would make me most happy, and then give me that”, then that might be a safe wish—assuming the genie/demon already has an appropriate understanding of what “happy” means, including things like long-term satisfaction etc. But an AI will presumably not start with such an understanding out the gate. Abstractly, the AI can learn its optimization target, but in order to do that it needs a learning target—the thing it’s trying to learn. And that learning target is itself what needs to be aligned. If we want the AI to learn what makes humans “happy”, in a safe way, then whatever it’s using as a proxy for “happiness” needs to be a safe optimization target.
On a side note, Yudkowsky’s “The Hidden Complexity of Wishes” is in many ways a better explanation of what I’m getting at. The one thing it doesn’t explain is how “more powerful” in the sense of “ability to grant more difficult wishes” translates into a more powerful optimizer. But that’s a pretty easy jump to make: wishes require satisficing, so we use the usual approach of a two-valued utility function.
I wasn’t imagining just input-output tests in laboratory conditions, which I agree are insufficient. I was thinking of studying counterfactuals, e.g. what the optimization target would suggest doing under the hypothetical scenario where it has lots of power. Alternatively, you could imagine tests of the form “pose this scenario, and see how the AI thinks about it”, e.g. to see whether the AI runs a check for whether it can deceive humans. (Yes, this assumes strong interpretability techniques that we don’t yet have. But if you want to claim that only proofs will work, you either need to claim that interpretability techniques can never be developed, or that even if they are developed they won’t solve the problem.)
Also, I probably should have mentioned this in the previous comment, but it’s not clear to me that it’s accurate to model AGI as an open-ended optimizer, in the same way that that’s not a great model of humans. I don’t particularly want to debate that claim, because those debates never help, but it’s a relevant fact to understanding my position.
I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work. So we might disagree less on that front than it first seemed.
The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don’t think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more “difficult” wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes—i.e. the chunks of state-space which they can access just don’t contain really bad outcomes.
Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with
The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn’t.
Three possibly-relevant points here.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don’t do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted—e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you’ve got the insights and the sorting algorithm in hand, the proof isn’t actually that much extra work, although it will still take some experts chewing on it a bit to make sure it’s correct.
That’s the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be “aligned”. Once those are figured out, I don’t expect proofs to be much harder than algorithms. Coming back to the “see whether the AI runs a check for whether it can deceive humans” example, the proof wouldn’t involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction—just like we write sorting algorithms so that they will always pass an is_sorted() check by construction.
Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don’t have a proof.
This was also what I was imagining. (Well, actually, I was also considering more than two years.)
It sounds like our disagreement is the one highlighted in Realism about rationality. When I say we could check whether the AI is deceiving humans, I don’t mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of “deception” that gives us a perfect checker. I don’t think notions like “deception”, “aligned”, “want”, “optimize”, etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult.
Jumping back to the original question, it sounds like the reason that you think that if we don’t have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right?
If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality—maybe physical phenomena aren’t amenable to crisp formal definitions, but “alignment” is.
It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the “realism about rationality” examples in that post—e.g. I don’t think AGI will necessarily be an “agent”, I don’t think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I’m not bothered by moral intuitions containing contradictions, etc).
I would also describe my “no proofs ⇒ doomed” view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don’t have the proofs, it’s highly unlikely that we understand the system well enough to usefully predict whether it is safe—but the proofs themselves play a relatively minor role.
I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are understood, the most likely errors are either simple mistakes (e.g. metric/standard units mixup) or missing some fundamental phenomenon (e.g. aerodynamics of a bridge). Those are the sort of problems which testing is good at catching. Testing is a double-check that we haven’t missed something critical; it is not the primary basis for thinking the system is safe.
A simple example, in contrast to AI: every engineering discipline I know of uses “safety factors”—i.e. make a beam twice as strong as it needs to be, give a wire twice the current capacity it needs, etc. A safety factor of 2 is typical in a wide variety of engineering fields. In AI, we cannot use safety factors because we do not even know what number we could double to make the AI more safe. Today, given any particular aspect of an AI system, we do not know whether adjusting any particular parameter will make the AI more or less reliable/risky.
The problem with tests is that the AI behaving well when weak enough to be tested doesn’t guarantee it will continue to do so.
If you are testing a system, that means that you are not confidant that it is safe. If it isn’t safe, then your only hope is for humans to stop it. Testing an AI is very dangerous unless you are confidant that it can’t harm you.
A paperclip maximizer would try to pass your tests until it was powerful enough to trick its way out and take over. Black box testing of arbitrary AI’s gets you very little safety.
Also some peoples intuitions think that a smile maximizing AI is a good idea. If you have a straightforward argument that appeals to the intuitions of the average Joe Blogs, and can’t be easily formalized, then I would take the difficulty formalizing it as evidence that the argument is not sound.
If you take a neural network and train it to recognize smiling faces, then attach that to AIXI, you get a machine that will appear to work in the lab, when the best it can do is make the scientists smile into its camera. There will be an intuitive argument about how it wants to make people smile, and people smile when they are happy. The AI will tile the universe with cameras pointed at smiley faces as soon as it escapes the lab.
See response to johnswentworth above.
A slightly misspecified reward function can lead to anything from perfectly aligned behavior to catastrophic failure. So I think we need much stronger and more formal arguments to believe that catastrophe is almost inevitable than EY’s genie post provides.
I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don’t have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.
I’d also argue against the empirical claim in that setting; do you agree with the empirical claim there?
I hold a nuanced view that I believe is more similar to the empirical claim than your views.
I think what we want is an extremely high level of justified confidence that any AI system or technology that is likely to become widely available is not carrying a significant and non-decreasing amount of Xrisk-per-second.
And it seems incredibly difficult and likely impossible to have such an extremely high level of justified confidence.
Formal verification and proof seem like the best we can do now, but I agree with you that we shouldn’t rule out other approaches to achieving extreme levels of justified confidence. What it all points at to me is the need for more work on epistemology, so that we can begin to understand how extreme levels of confidence actually operate.
This sounds like the normative claim, not the empirical one, given that you said “what we want is...”
Yep, good catch ;)
I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I’m epistemically humble enough these days to think it’s not reasonable to say “nearly inevitable” if you integrate out epistemic uncertainty.
But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?
Partly (I’m surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.
Yeah, I’d be interested in your answers anyway.
I’m not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I’m actually more interested in hearing your take on those lines of argument than saying mine ATM :P
Re: convergent rationality, I don’t buy it (specifically the “convergent” part).
Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.
But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”. I think the strongest argument against is “when you have an adversary optimizing against you, nothing short of proofs can give you confidence”, which seems to be somewhat true in security. But then I think there are ways that you can get confidence in “the AI system will not adversarially optimize against me” using techniques that are not proofs.
(Note the alternative to proofs is not trial and error. I don’t use trial and error to successfully board a flight, but I also don’t have a proof that my strategy is going to cause me to successfully board a flight.)
Totally agree; it’s an under-appreciated point!
Here’s my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don’t actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)
The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.
This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.
I’m personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the “early days” of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of “social epistemology” would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I’d argue we’re in the process of failing catastrophically at that)