Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
Welcome to another special edition of the newsletter! In this edition, I summarize four conversations that AI Impacts had with researchers who were optimistic that AI safety would be solved “by default”. (Note that one of the conversations was with me.)
While all four of these conversations covered very different topics, I think there were three main points of convergence. First, we were relatively unconvinced by the traditional arguments for AI risk, and find discontinuities relatively unlikely. Second, we were more optimistic about solving the problem in the future, when we know more about the problem and have more evidence about powerful AI systems. And finally, we were more optimistic that as we get more evidence of the problem in the future, the existing ML community will actually try to fix that problem.
Conversation with Paul Christiano (Paul Christiano, Asya Bergal, Ronny Fernandez, and Robert Long) (summarized by Rohin): There can’t be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left (ETA: see this comment). So, the prior that any particular thing has such an impact should be quite low. With AI in particular, obviously we’re going to try to make AI systems that do what we want them to do. So starting from this position of optimism, we can then evaluate the arguments for doom. The two main arguments: first, we can’t distinguish ahead of time between AIs that are trying to do the right thing, and AIs that are trying to kill us, because the latter will behave nicely until they can execute a treacherous turn. Second, since we don’t have a crisp concept of “doing the right thing”, we can’t select AI systems on whether they are doing the right thing.
However, there are many “saving throws”, or ways that the argument could break down, avoiding doom. Perhaps there’s no problem at all, or perhaps we can cope with it with a little bit of effort, or perhaps we can coordinate to not build AIs that destroy value. Paul assigns a decent amount of probability to each of these (and other) saving throws, and any one of them suffices to avoid doom. This leads Paul to estimate that AI risk reduces the expected value of the future by roughly 10%, a relatively optimistic number. Since it is so neglected, concerted effort by longtermists could reduce it to 5%, making it still a very valuable area for impact. The main way he expects to change his mind is from evidence from more powerful AI systems, e.g. as we build more powerful AI systems, perhaps inner optimizer concerns will materialize and we’ll see examples where an AI system executes a non-catastrophic treacherous turn.
Paul also believes that clean algorithmic problems are usually solvable in 10 years, or provably impossible, and early failures to solve a problem don’t provide much evidence of the difficulty of the problem (unless they generate proofs of impossibility). So, the fact that we don’t know how to solve alignment now doesn’t provide very strong evidence that the problem is impossible. Even if the clean versions of the problem were impossible, that would suggest that the problem is much more messy, which requires more concerted effort to solve but also tends to be just a long list of relatively easy tasks to do. (In contrast, MIRI thinks that prosaic AGI alignment is probably impossible.)
Note that even finding out that the problem is impossible can help; it makes it more likely that we can all coordinate to not build dangerous AI systems, since no one wants to build an unaligned AI system. Paul thinks that right now the case for AI risk is not very compelling, and so people don’t care much about it, but if we could generate more compelling arguments, then they would take it more seriously. If instead you think that the case is already compelling (as MIRI does), then you would be correspondingly more pessimistic about others taking the arguments seriously and coordinating to avoid building unaligned AI.
One potential reason MIRI is more doomy is that they take a somewhat broader view of AI safety: in particular, in addition to building an AI that is trying to do what you want it to do, they would also like to ensure that when the AI builds successors, it does so well. In contrast, Paul simply wants to leave the next generation of AI systems in at least as good a situation as we find ourselves in now, since they will be both better informed and more intelligent than we are. MIRI has also previously defined aligned AI as one that produces good outcomes when run, which is a much broader conception of the problem than Paul has. But probably the main disagreement between MIRI and ML researchers and that ML researchers expect that we’ll try a bunch of stuff, and something will work out, whereas MIRI expects that the problem is really hard, such that trial and error will only get you solutions that appear to work.
Rohin’s opinion: A general theme here seems to be that MIRI feels like they have very strong arguments, while Paul thinks that they’re plausible arguments, but aren’t extremely strong evidence. Simply having a lot more uncertainty leads Paul to be much more optimistic. I agree with most of this.
However, I do disagree with the point about “clean” problems. I agree that clean algorithmic problems are usually solved within 10 years or are provably impossible, but it doesn’t seem to me like AI risk counts as a clean algorithmic problem: we don’t have a nice formal statement of the problem that doesn’t rely on intuitive concepts like “optimization”, “trying to do something”, etc. This suggests to me that AI risk is more “messy”, and so may require more time to solve.
Conversation with Rohin Shah (Rohin Shah, Asya Bergal, Robert Long, and Sara Haxhia) (summarized by Rohin): The main reason I am optimistic about AI safety is that we will see problems in advance, and we will solve them, because nobody wants to build unaligned AI. A likely crux is that I think that the ML community will actually solve the problems, as opposed to applying a bandaid fix that doesn’t scale. I don’t know why there are different underlying intuitions here.
In addition, many of the classic arguments for AI safety involve a system that can be decomposed into an objective function and a world model, which I suspect will not be a good way to model future AI systems. In particular, current systems trained by RL look like a grab bag of heuristics that correlate well with obtaining high reward. I think that as AI systems become more powerful, the heuristics will become more and more general, but they still won’t decompose naturally into an objective function, a world model, and search. In addition, we can look at humans as an example: we don’t fully pursue convergent instrumental subgoals; for example, humans can be convinced to pursue different goals. This makes me more skeptical of traditional arguments.
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us. (This is very similar to the picture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been published and I was not aware of Chris’s views at the time of this conversation.)
I’m also less worried about race dynamics increasing accident risk than the median researcher. The benefit of racing a little bit faster is to have a little bit more power / control over the future, while also increasing the risk of extinction a little bit. This seems like a bad trade from each agent’s perspective. (That is, the Nash equilibrium is for all agents to be cautious, because the potential upside of racing is small and the potential downside is large.) I’d be more worried if [AI risk is real AND not everyone agrees AI risk is real when we have powerful AI systems], or if the potential upside was larger (e.g. if racing a little more made it much more likely that you could achieve a decisive strategic advantage).
Overall, it feels like there’s around 90% chance that AI would not cause x-risk without additional intervention by longtermists. The biggest disagreement between me and more pessimistic researchers is that I think gradual takeoff is much more likely than discontinuous takeoff (and in fact, the first, third and fourth paragraphs above are quite weak if there’s a discontinuous takeoff). If I condition on discontinuous takeoff, then I mostly get very confused about what the world looks like, but I also get a lot more worried about AI risk, especially because the “AI is to humans as humans are to ants” analogy starts looking more accurate. In the interview I said 70% chance of doom in this world, but with way more uncertainty than any of the other credences, because I’m really confused about what that world looks like. Two other disagreements, besides the ones above: I don’t buy Realism about rationality (AN #25), whereas I expect many pessimistic researchers do. I may also be more pessimistic about our ability to write proofs about fuzzy concepts like those that arise in alignment.
On timelines, I estimated a very rough 50% chance of AGI within 20 years, and 30-40% chance that it would be using “essentially current techniques” (which is obnoxiously hard to define). Conditional on both of those, I estimated 70% chance that it would be something like a mesa optimizer; mostly because optimization is a very useful instrumental strategy for solving many tasks, especially because gradient descent and other current algorithms are very weak optimization algorithms (relative to e.g. humans), and so learned optimization algorithms will be necessary to reach human levels of sample efficiency.
Rohin’s opinion: Looking over this again, I’m realizing that I didn’t emphasize enough that most of my optimism comes from the more outside view type considerations: that we’ll get warning signs that the ML community won’t ignore, and that the AI risk arguments are not watertight. The other parts are particular inside view disagreements that make me more optimistic, but they don’t factor in much into my optimism besides being examples of how the meta considerations could play out. I’d recommend this comment of mine to get more of a sense of how the meta considerations factor into my thinking.
I was also glad to see that I still broadly agree with things I said ~5 months ago (since no major new opposing evidence has come up since then), though as I mentioned above, I would now change what I place emphasis on.
Conversation with Robin Hanson (Robin Hanson, Asya Bergal, and Robert Long) (summarized by Rohin): The main theme of this conversation is that AI safety does not look particularly compelling on an outside view. Progress in most areas is relatively incremental and continuous; we should expect the same to be true for AI, suggesting that timelines should be quite long, on the order of centuries. The current AI boom looks similar to previous AI booms, which didn’t amount to much in the past.
Timelines could be short if progress in AI were “lumpy”, as in a FOOM scenario. This could happen if intelligence was one simple thing that just has to be discovered, but Robin expects that intelligence is actually a bunch of not-very-general tools that together let us do many things, and we simply have to find all of these tools, which will presumably not be lumpy. Most of the value from tools comes from more specific, narrow tools, and intelligence should be similar. In addition, the literature on human uniqueness suggests that it wasn’t “raw intelligence” or small changes to brain architecture that makes humans unique, it’s our ability to process culture (communicating via language, learning from others, etc).
In any case, many researchers are now distancing themselves from the FOOM scenario, and are instead arguing that AI risk occurs due to standard principal-agency problems, in the situation where the agent (AI) is much smarter than the principal (human). Robin thinks that this doesn’t agree with the existing literature on principal-agent problems, in which losses from principal-agent problems tend to be bounded, even when the agent is smarter than the principal.
You might think that since the stakes are so high, it’s worth working on it anyway. Robin agrees that it’s worth having a few people (say a hundred) pay attention to the problem, but doesn’t think it’s worth spending a lot of effort on it right now. Effort is much more effective and useful once the problem becomes clear, or once you are working with a concrete design; we have neither of these right now and so we should expect that most effort ends up being ineffective. It would be better if we saved our resources for the future, or if we spent time thinking about other ways that the future could go (as in his book, Age of Em).
It’s especially bad that AI safety has thousands of “fans”, because this leads to a “crying wolf” effect—even if the researchers have subtle, nuanced beliefs, they cannot control the message that the fans convey, which will not be nuanced and will instead confidently predict doom. Then when doom doesn’t happen, people will learn not to believe arguments about AI risk.
Rohin’s opinion: Interestingly, I agree with almost all of this, even though it’s (kind of) arguing that I shouldn’t be doing AI safety research at all. The main place I disagree is that losses from principal-agent problems with perfectly rational agents are bounded—this seems crazy to me, and I’d be interested in specific paper recommendations (though note I and others have searched and not found many).
On the point about lumpiness, my model is that there are only a few underlying factors (such as the ability to process culture) that allow humans to so quickly learn to do so many tasks, and almost all tasks require near-human levels of these factors to be done well. So, once AI capabilities on these factors reach approximately human level, we will “suddenly” start to see AIs beating humans on many tasks, resulting in a “lumpy” increase on the metric of “number of tasks on which AI is superhuman” (which seems to be the metric that people often use, though I don’t like it, precisely because it seems like it wouldn’t measure progress well until AI becomes near-human-level).
Conversation with Adam Gleave (Adam Gleave et al) (summarized by Rohin): Adam finds the traditional arguments for AI risk unconvincing. First, it isn’t clear that we will build an AI system that is so capable that it can fight all of humanity from its initial position where it doesn’t have any resources, legal protections, etc. While discontinuous progress in AI could cause this, Adam doesn’t see much reason to expect such discontinuous progress: it seems like AI is progressing by using more computation rather than finding fundamental insights. Second, we don’t know how difficult AI safety will turn out to be; he gives a probability of ~10% that the problem is as hard as (a caricature of) MIRI suggests, where any design not based on mathematical principles will be unsafe. This is especially true because as we get closer to AGI we’ll have many more powerful AI techniques that we can leverage for safety. Thirdly, Adam does expect that AI researchers will eventually solve safety problems; they don’t right now because it seems premature to work on those problems. Adam would be more worried if there were more arms race dynamics, or more empirical evidence or solid theoretical arguments in support of speculative concerns like inner optimizers. He would be less worried if AI researchers spontaneously started to work on relative problems (more than they already do).
Adam makes the case for AI safety work differently. At the highest level, it seems possible to build AGI, and some organizations are trying very hard to build AGI, and if they succeed it would be transformative. That alone is enough to justify some effort into making sure such a technology is used well. Then, looking at the field itself, it seems like the field is not currently focused on doing good science and engineering to build safe, reliable systems. So there is an opportunity to have an impact by pushing on safety and reliability. Finally, there are several technical problems that we do need to solve before AGI, such as how we get information about what humans actually want.
Adam also thinks that it’s 40-50% likely that when we build AGI, a PhD thesis describing it would be understandable by researchers today without too much work, but ~50% that it’s something radically different. However, it’s only 10-20% likely that AGI comes only from small variations of current techniques (i.e. by vastly increasing data and compute). He would see this as more likely if we hit additional milestones by investing more compute and data (OpenAI Five was an example of such a milestone).
Rohin’s opinion: I broadly agree with all of this, with two main differences. First, I am less worried about some of the technical problems that Adam mentions, such as how to get information about what humans want, or how to improve the robustness of AI systems, and more concerned about the more traditional problem of how to create an AI system that is trying to do what you want. Second, I am more bullish on the creation of AGI using small variations on current techniques, but vastly increasing compute and data (I’d assign ~30%, while Adam assigns 10-20%).
I don’t follow this argument; I also checked the transcript, and I still don’t see why I should buy it. Paul said:
In my words, the argument is “we agree that the future has nontrivial EV, therefore big negative impacts are a priori unlikely”.
But why do we agree about this? Why are we assuming the future can’t be that bleak in expectation? I think there are good outside-view arguments to this effect, but that isn’t the reasoning here.
E.g. if you have a broad distribution over possible worlds, some of which are “fragile” and have 100 things that cut value down by 10%, and some of which are “robust” and don’t, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.
Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world—if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it’s a reasonably robust conclusion across many different frameworks: your decision shouldn’t end up being dominated by some hugely conjunctive event.
I’m more uncertain about this one, but I believe that a separate problem with this answer is that it’s an argument about where value comes from, not an argument about what is probable. Let’s suppose 50% of all worlds are fragile and 50% are robust. If most of the things that destroy a world are due to emerging technology, then we still have similar amounts of both worlds around right now (or similar measure on both classes if they’re infinite many, or whatever). So it’s not a reason to suspect a non-fragile world right now.
Another illustration: if you’re currently falling from a 90-story building, most of the expected utility is in worlds where there coincidentally happens to be a net to safely catch you before you hit the ground, or interventionist simulators decide to rescue you—even if virtually all of the probability is in worlds where you go splat and die. The decision theory looks right, but this is a lot less comforting than the interview made it sound.
Yes, but the fact that the fragile worlds are much more likely to end in the future is a reason to condition your efforts on being in a robust world.
While I do buy Paul’s argument, I think it’d be very helpful if the various summaries of the interviews with him were edited to make it clear that he’s talking about value-conditioned probabilities rather than unconditional probabilities—since the claim as originally stated feels misleading. (Even if some decision theories only use the former, most people think in terms of the latter).
Is this a thing or something you just coined? “Probability” has a meaning, I’m totally against using it for things that aren’t that.
I get why the argument is valid for deciding what we should do – and you could argue that’s the only important thing. But it doesn’t make it more likely that our world is robust, which is what the post was claiming. It’s not about probability, it’s about EV.
This argument seems to point at some extremely important considerations in the vicinity of “we should act according to how we want civilizations similar to us to act” (rather than just focusing on causally influencing our future light cone), etc.
The details of the distribution over possible worlds that you use here seem to matter a lot. How robust are the “robust worlds”? If they are maximally robust (i.e. things turn out great with probability 1 no matter what the civilization does) then we should assign zero weight to the prospect of being in a “robust world”, and place all our chips on being in a “fragile world”.
Contrarily, if the distribution over possible worlds assigns sufficient probability to worlds in which there is a single very risky thing that cuts EV down by either 10% or 90% depending on whether the civilization takes it seriously or not, then perhaps such worlds should dominate our decision making.
This is only true if you assume that there is an equal number of robust and fragile worlds out there, and your uncertainty is strictly random, i.e. you’re uncertain about which of those worlds you live in.
I’m not super confident that our world is fragile, but I suspect that most worlds look the same. I.e., maybe 99.99% of worlds are robust, maybe 99.99% are fragile. If it’s the latter, then I probably live in a fragile world.
If it’s a 50% chance that 99.99% of worlds are robust and 50% chance that 99.99% are fragile, then the vast majority of EV comes from the first option where the vast majority of worlds are robust.
You’re right, the nature of uncertainty doesn’t actually matter for the EV. My bad.
I think it does actually, although I’m not sure how exactly. See Logical vs physical risk aversion.
I’d be interested to hear a bit more about your position on this.
I’m going to argue for the “applying bandaid fixes that don’t scale” position for a second. To me, it seems that there’s a strong culture in ML of “apply random fixes until something looks like it works” and then just rolling with whatever comes out of that algorithm.
I’ll draw attention to image modelling to illustrate what I’m pointing at. Up until about 2014, the main metric for evaluating an image quality was the bayesian negative log likelyhood. As far as I can tell, this goes all the way back to at least “To Recognize Shapes, First Learn to Generate Images” Where the CD algorithm acts to minimize the log likelihood of the data. This can be seen in the VAE paper and also the original GAN paper. However, after GANs became popular, the log likelyhood metric seemed to have gone out the window. The GANs made really compelling images. Due to the difficulty of evaluating NLL, people invented new metrics. ID and FID were used to assess the quality of the generated images. I might be wrong, but I think it took a while after that for people to realize that SOTA GANs we’re getting terrible NNLs compared to SOTA VAEs, even though the VAE’s generated images that we’re significantly blurrier/noisier. It also became obvious that GANs were dropping modes of the distribution, effectively failing to model entire classes of images.
As far as I can, tell there’s been a lot of work to get GANs to model all image modes. The most salient and recent would be DeepMinds PresGAN . Where they clearly show the issue and how PresGAN solves it in Figure 1. However, looking at table 5, there’s still a huge gap between in NLL between PresGAN and VAEs. It seems to me that most of the attempt to solve this issue are very similar to “bandaid fixes that don’t scale” in the sense that they mostly feel like hacks. None of them really address the gap in likelyhood between VAEs and GANs.
I’m worried that a similar story could happen with AI safety. A problem arises and gets swept under the rug for a bit. Later, it’s rediscovered and becomes common knowledge. Then, instead of solving it before moving forward, we see massive increases in capabilities. Simultaneously, the problem is at most addressed with hacks that don’t really solve the problem, or solve it just enough to prevent the increase in capabilities from becoming obviously unjustified.
I agree that ML often does this, but only in situations where the results don’t immediately matter. I’d find it much more compelling to see examples where the “random fix” caused actual bad consequences in the real world.
Perhaps people are optimizing for “making pretty pictures” instead of “negative log likelihood”. I wouldn’t be surprised if for many applications of GANs, diversity of images is not actually that important, and what you really want is that the few images you do generate look really good. In that case, it makes complete sense to push primarily on GANs, and while you try to address mode collapse, when faced with a tradeoff you choose GANs over VAEs anyway.
Suppose that we had extremely compelling evidence that any AI system run with > X amount of compute would definitely kill us all. Do you expect that problem to get swept under the rug?
Assuming your answer is no, then it seems like whether a problem gets swept under the rug depends on particular empirical considerations, such as:
How bad it would be if the problem was real (the magnitude of the downside). This could be evaluated with respect to society and to the individual agents deciding whether or not to deploy the potentially problematic AI.
How compelling the evidence is that the problem is real.
I tend to think that existing problems with AI are not that bad (though in most cases obviously quite real), while long-term concerns about AI would be very bad, but are not obviously real. If the long-term concerns are real, we should get more evidence about them in the future, and then we’ll have a problem that is both very bad and (more) clearly real, and that’s when I expect that it will be taken seriously.
Consider e.g. fairness and bias. Nobody thinks that the problem is solved. People do continue to deploy unfair and biased AI systems, but that’s because the downside of unfair and biased AI systems is smaller in magnitude than the upside of using the AI systems in the first place—they aren’t being deployed because people think they have “solved the problem”.
Current ML culture is to test 100′s of things in a lab until one works. This is fine as long as the AI’s being tested are not smart enough to break out of the lab, or realize they are being tested and play nice until deployment. The default way to test a design is to run it and see, not to reason abstractly about it.
Part of the problem is that we have a really strong unilateralist’s curse. It only takes 1, or a few people who don’t realize the problem to make something really dangerous. Banning it is also hard, law enforcement isn’t 100% effective, different countries have different laws and the main real world ingredient is access to a computer.
The people who are ignoring or don’t understand the current evidence will carry on ignoring or not understanding it. A few more people will be convinced, but don’t expect to convince a creationist with one more transitional fossil.
This is a foom-ish assumption; remember that Rohin is explicitly talking about a non-foom scenario.
^ Yeah, in FOOM worlds I agree more with your (Donald’s) reasoning. (Though I still have questions, like, how exactly did someone stumble upon the correct mathematical principles underlying intelligence by trial and error?)
I don’t think we have good current evidence, so I don’t infer much about whether or not people will buy future evidence from their reactions to current evidence. (See also six heuristics that I think cut against AI risk even after knowing the arguments for AI risk.)
You mentioned that, conditional on foom, you’d be confused about what the world looks like. Is this the main thing you’re confused about in foom worlds, or are there other major things too?
Lots of other things:
Are we imagining a small team of hackers in their basement trying to get AGI on a laptop, or a big corporation using tons of resources?
How does the AGI learn about the world? If you say “it reads the Internet”, how does it learn to read?
When the developers realize that they’ve built AGI, is it still possible for them to pull the plug?
Why doesn’t the AGI try to be deceptive in ways that we can detect, the way children do? Is it just immediately as capable as a smart human and doesn’t need any training? How can that happen by just “finding the right architecture”?
Why is this likely to happen soon when it hasn’t happened in the last sixty years?
I suspect answers to these will provoke lots of other questions. In contrast, the non-foom worlds that still involve AGI + very fast growth seem much closer to a “business-as-usual” world.
I also think that if you’re worried about foom, you should basically not care about any of the work being done at DeepMind / OpenAI right now, because that’s not the kind of work that can foom (except in the “we suddenly find the right architecture” story); yet I notice lots of doomy predictions about AGI are being driven by DM / OAI’s work. (Of course, plausibly you think OpenAI / DM are not going to succeed, even if others do.)
I’m going to start a fresh thread on this, it sounds more interesting (at least to me) than most of the other stuff being discussed here.
If there’s an implicit assumption here that FOOM worlds require someone to stumble upon “the correct mathematical principles underlying intelligence”, I don’t understand why such an assumption is justified. For example, suppose that at some point in the future some top AI lab will throw $1B at a single massive neural architecture search—over some arbitrary slightly-novel architecture space—and that NAS will stumble upon some complicated architecture that its corresponding model, after being trained with a massive amount of computing power, will implement an AGI.
In this case I’m asking why the NAS stumbled upon the correct mathematical architecture underlying intelligence.
Or rather, let’s dispense with the word “mathematical” (which I mainly used because it seems to me that the arguments for FOOM usually involve someone coming up with the right mathematical insight underlying intelligence).
It seems to me that to get FOOM you need the property “if you make even a slight change to the thing, then it breaks and doesn’t work”, which I’ll call fragility. Note that you cannot find fragile things using local search, except if you “get lucky” and start out at the correct solution.
Why did the NAS stumble upon the correct fragile architecture underlying intelligence?
The above ‘FOOM via $1B NAS’ scenario doesn’t seem to me to require this property. Notice that the increase in capabilities during that NAS may be gradual (i.e. before evaluating the model that implements an AGI the NAS evaluates models that are “almost AGI”). The scenario would still count as a FOOM as long as the NAS yields an AGI and no model before that NAS ever came close to AGI.
Conditioned on [$1B NAS yields the first AGI], a FOOM seems to me particularly plausible if either:
no previous NAS at a similar scale was ever carried out; or
the “path in model space” that the NAS traverses is very different from all the paths that previous NASs traversed. This seems to me plausible even if the model space of the $1B NAS is identical to ones used in previous NASs (e.g. if different random seeds yield very different paths); and it seems to me even more plausible if the model space of the $1B NAS is slightly novel.
In this case I’d apply the fragility argument to the research process, which was my original point (though it wasn’t phrased as well then). In the NAS setting, my question is:
Basically, if you’re arguing that most ML researchers just do a bunch of trial-and-error, then you should be modeling ML research as a local search in idea-space, and then you can apply the same fragility argument to it.
Conditioned on [$1B NAS yields the first AGI], that NAS itself may essentially be “a local search in idea-space”. My argument is that such a local search in idea-space need not start in a world where “almost-AGI” models already exist (I listed in the grandparent two disjunctive reasons in support of this).
Relatedly, “modeling ML research as a local search in idea-space” is not necessarily contradictory to FOOM, if an important part of that local search can be carried out without human involvement (which is a supposition that seems to be supported by the rise of NAS and meta-learning approaches in recent years).
I don’t see how my reasoning here relies on it being possible to “find fragile things using local search”.
Okay, responding to those directly:
I have many questions about this scenario:
What caused the researchers to go from “$1M run of NAS” to “$1B run of NAS”, without first trying “$10M run of NAS”? I especially have this question if you’re modeling ML research as “trial and error”; I can imagine justifying a $1B experiment before a $10M experiment if you have some compelling reason that the result you want will happen with the $1B experiment but not the $10M experiment; but if you’re doing trial and error then you don’t have a compelling reason.
Current AI systems are very subhuman, and throwing more money at NAS has led to relatively small improvements. Why don’t we expect similar incremental improvements from the next 3-4 orders of magnitude of compute?
Suppose that such a NAS did lead to human-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did? How does that cause a FOOM? (Yes, the improvements the AI makes compound, whereas the improvements we make to AI don’t compound, but to me that’s the canonical case of continuous takeoff, e.g. as described in Takeoff speeds.)
In all the previous NASs, why did the paths taken produce AI systems that were so much worse than the one taken by the $1B NAS? Did the $1B NAS just get lucky?
(Again, this really sounds like a claim that “the path taken by NAS” is fragile.)
If you want to make the case for a discontinuity because of the lack of human involvement, you would need to argue:
The replacement for humans is way cheaper / faster / more effective than humans (in that case why wasn’t it automated earlier?)
The discontinuity happens as soon as humans are replaced (otherwise, the system-without-human-involvement becomes the new baseline, and all future systems will look like relatively continuous improvements of this system)
The second point definitely doesn’t apply to NAS and meta-learning, and I would argue that the first point doesn’t apply either, though that’s not obvious.
I indeed model a big part of contemporary ML research as “trial and error”. I agree that it seems unlikely that before the first $1B NAS there won’t be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I’m pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.
If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don’t fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution). More generally, when using trend extrapolations in AI, consider the following from this Open Phil blog post (2016) by Holden Karnofsky (footnote 7):
(The link in the quote appears to be broken, here is one that works.)
NAS seems to me like a good example for an expensive computation that could plausibly constitute a “search in idea-space” that finds an AGI model (without human involvement). But my argument here applies to any such computation. I think it may even apply to a ‘$1B SGD’ (on a single huge network), if we consider a gradient update (or a sequence thereof) to be an “exploration step in idea-space”.
I first need to understand what “human-level AGI” means. Can models in this category pass strong versions of the Turing test? Does this category exclude systems that outperform humans on one or more important dimensions? (It seems to me that the first SGD-trained model that passes strong versions of the Turing test may be a superintelligence.)
Yes, the $1B NAS may indeed just get lucky. A local search sometimes gets lucky (in the sense of finding a local optimum that is a lot better than the ones found in most runs; not in the sense of miraculously starting the search at a great fragile solution). [EDIT: also, something about this NAS might be slightly novel—like the neural architecture space.]
In some past cases where humans did not serve any role in performance gains that were achieved with more compute/data (e.g. training GPT-2 by scaling up GPT), there were no humans to replace. So I don’t understand the question “why wasn’t it automated earlier?”
In the second point, I need to first understand how you define that moment in which “humans are replaced”. (In the $1B NAS scenario, would that moment be the one in which the NAS is invoked?)
Meta: I feel like I am arguing for “there will not be a discontinuity”, and you are interpreting me as arguing for “we will not get AGI soon / AGI will not be transformative”, neither of which I believe. (I have wide uncertainty on timelines, and I certainly think AGI will be transformative.) I’d like you to state what position you think I’m arguing for, tabooing “discontinuity” (not the arguments for it, just the position).
I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me. I’m more uncertain about the fire alarm question.
This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I am not compelled by arguments that tell me to worry about scenario X without giving me a reason to believe that scenario X is likely. (Compare: “we can’t rule out the possibility that the simulators want us to build a tower to the moon or else they’ll shut off the simulation, so we better get started on that moon tower.”)
This is not to say the such scenario X’s must be false—reality could be that way—but that given my limited amount of time, I must prioritize which scenarios to pay attention to, and one really good heuristic for that is to focus on scenarios that have some inside-view reason that makes me think they are likely. If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).
Some other more tangential things:
The trend that changed in 2012 was that of the amount of compute applied to deep learning. I suspect trend extrapolation with compute as the x-axis would do okay; trend extrapolation with calendar year as the x-axis would do poorly. But as I mentioned above, this is not a crux for me, since it doesn’t give me an inside-view reason to expect FOOM; I wouldn’t even consider it weak evidence for FOOM if I changed my mind on this. (If the data showed a big discontinuity, that would be evidence, but I’m fairly confident that while there was a discontinuity it was relatively small.)
I think you’re arguing for something like: Conditioned on [the first AGI is created at time t by AI lab X], it is very unlikely that immediately before t the researchers at X have a very low credence in the proposition “we will create an AGI sometime in the next 30 days”.
(Tbc, I did not interpret you as arguing about timelines or AGI transformativeness; and neither did I argue about those things here.)
Using the “fire alarm” concept here was a mistake, sorry for that. Instead of writing:
I should have written:
I generally have a vague impression that many AIS/x-risk people tend to place too much weight on trend extrapolation arguments in AI (or tend to not give enough attention to important details of such arguments), which may have triggered me to write the related stuff (in response to you seemingly applying a trend extrapolation argument with respect to NAS). I was not listing the reasons for my beliefs specifically about NAS.
(I’m mindful of your time and so I don’t want to branch out this discussion into unrelated topics, but since this seems to me like a potentially important point...) Even if we did have infinite time and the ability to somehow determine the correctness of any given hypothesis with super-high-confidence, we may not want to evaluate all hypotheses—that involve other agents—in arbitrary order. Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time). For example, after considering some game-theoretical meta considerations we might decide to make certain binding commitments before evaluating such and such hypotheses; or we might decide about what additional things we should consider or do before evaluating some other hypotheses, etcetera.
Conditioned on the first AGI being aligned, it may be important to figure out how do we make sure that that AGI “behaves wisely” with respect to this topic (because the AGI might be able to evaluate a lot of weird hypotheses that we can’t).
Can you give me an example? I don’t see how this would work.
(Tbc, I’m imagining that the universe stops, and only I continue thinking; there are no other agents thinking while I’m thinking, and so afaict I should just implement UDT.)
Creating some sort of commitment device that would bind us to follow UDT—before we evaluate some set of hypotheses—is an example for one potentially consequential intervention.
As an aside, my understanding is that in environments that involve multiple UDT agents, UDT doesn’t necessarily work well (or is not even well-defined?).
Also, if we would use SGD to train a model that ends up being an aligned AGI, maybe we should figure out how to make sure that that model “follows” a good decision theory. (Or does this happen by default? Does it depend on whether “following a good decision theory” is helpful for minimizing expected loss on the training set?)
It wasn’t exactly that (in particular, I didn’t have the researcher’s beliefs in mind), but I also believe that statement for basically the same reasons so that should be fine. There’s a lot of ambiguity in that statement (specifically, what is AGI), but I probably believe it for most operationalizations of AGI.
(For reference, I was considering “will there be a 1 year doubling of economic output that started before the first 4 year doubling of economic output ended”; for that it’s not sufficient to just argue that we will get AGI suddenly, you also have to argue that the AGI will very quickly become superintelligent enough to double economic output in a very short amount of time.)
I mean, the difference between a $100M NAS and a $1B NAS is:
Up to 10x the number of models evaluated
Up to 10x the size of models evaluated
If you increase the number of models by 10x and leave the size the same, that somewhat increases your optimization power. If you model the NAS as picking architectures randomly, the $1B NAS can have at most 10x the chance of finding AGI, regardless of fragility, and so can only have at most 10x the expected “value” (whatever your notion of “value”).
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference, e.g. the max of n draws from Uniform([0, 1]) has expected value nn+1=1−1n+1, so once n is already large (e.g. 100), increasing it makes ~no difference. Of course, our actual distributions will probably be more bottom-heavy, but as distributions get more bottom-heavy we use gradient descent / evolutionary search to deal with that.
For the size, it’s possible that increases in size lead to huge increases in intelligence, but that doesn’t seem to agree with ML practice so far. Even if you ignore trend extrapolation, I don’t see a reason to expect that increasing model sizes should mean the difference between not-even-close-to-AGI and AGI.
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
Earlier in this discussion you defined fragility as the property “if you make even a slight change to the thing, then it breaks and doesn’t work”. While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don’t follow the logic of that paragraph.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10 (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)
I’ve seen the “ML gets deployed carelessly” narrative pop up on LW a bunch, and while it does seem accurate in many cases, I wanted to note that there are counter-examples. The most prominent counter-example I’m aware of is the incredibly cautious approach DeepMind/Google took when designing the ML system that cools Google’s datacenters.
This seems to be careful deployment. The concept of deployment is going from an AI in the lab, to the same AI in control of a real world system. Suppose your design process was to fiddle around in the lab until you make something that seems to work. Once you have that, you look at it to understand why it works. You try to prove theorems about it. You subject it to some extensive battery of testing and will only put it in a self driving car/ data center cooling system once you are confident it is safe.
There are two places this could fail. Your testing procedures could be insufficient, or your AI could hack out of the lab before the testing starts. I see little to no defense against the latter.
[...]
This is fair. However, the point of the example is more that mode dropping and bad NLL were not noticed when people started optimizing GANs for image quality. As far as I can tell, it took a while for individuals to notice, longer for it to become common knowledge, and even more time for anyone to do anything about it. Even now, the “solutions” are hacks that don’t completely resolve the issue.
There was a large window of time where a practitioner could implement a GAN expecting it to cover all the modes. If there was a world where failing to cover all the modes of the distribution lead to large negative consequences, the failure would probably have gone unnoticed until it was too late.
Here’s a real example. This is the NTSB crash report for the Uber autonomous vehicle that killed a pedestrian. Someone should probably do an in depth analysis of the whole thing, but for now I’ll draw your attention to section 1.6.2. Hazard Avoidance and Emergency Braking. In it they say:
[...]
This strikes me as a “random fix” where the core issue was that the system did not have sufficient discriminatory power to tell apart a safe situation from an unsafe situation. Instead of properly solving this problem, the researchers put in a hack.
I agree that we shouldn’t be worried about situations where there is a clear threat. But that’s not quite the class of failures that I’m worried about. Fairness, bias, and adversarial examples are all closer to what I’m getting at. The general pattern is that ML researchers hack together a system that works, but has some problems they’re unaware of. Later, the problems are discovered and the reaction is to hack together a solution. This is pretty much the opposite of the safety mindset EY was talking about. It leaves room for catastrophe in the initial window when the problem goes undetected, and indefinitely afterwards if the hack is insufficient to deal with the issue.
More specifically, I’m worried about a situation where at some point during grad student decent someone says, “That’s funny...” then goes on to publish their work. Later, someone else deploys their idea plus 3 orders of magnitude more computing power and we all die. That, or we don’t all die. Instead we resolve the issue with a hack. Then a couple bumps in computing power and capabilities later we all die.
The above comes across as both paranoid and farfeched, and I’m not sure the AI community will take on the required level of caution to prevent it unless we get an AI equivalent of Chernobyl before we get UFAI. Nuclear reactor design is the only domain I know of where people are close to sufficiently paranoid.
Important thing to remember is that Rohin is explicitly talking about a non-foom scenario, so the assumption is that humanity would survive AI-Chernobyl.
My worry is less that we wouldn’t survive AI-Chernobyl as much as it is that we won’t get an AI-Chernobyl.
I think that this is where there’s a difference in models. Even in a non-FOOM scenario I’m having a hard time envisioning a world where the gap in capabilities between AI-Chernobyl and global catastrophic UFAI is that large. I used Chernobyl as an example because it scared the public and the industry into making things very safe. It had a lot going for it to make that happen. Radiation is invisible and hurts you by either killing you instantly, making your skin fall off, or giving you cancer and birth defects. The disaster was also extremely expensive, with the total costs on the order of 10^11 USD$.
If a defective AI system manages to do something that instils the same level of fear into researchers and the public as Chernobyl did, I would expect that we were on the cusp of building systems that we couldn’t control at all.
If I’m right and the gap between those two events is small, then there’s a significant risk that nothing will happen in that window. We’ll get plenty of warnings that won’t be sufficient to instil the necessary level of caution into the community, and later down the road we’ll find ourselves in a situation we can’t recover from.
My impression is that people working on self-driving cars are incredibly safety-conscious, because the risks are very salient.
I don’t think AI-Chernobyl has to be a Chernobyl level disaster, just something that makes the risks salient. E.g. perhaps an elder care AI robot pretends that all of its patients are fine in order to preserve its existence, and this leads to a death and is then discovered. If hospitals let AI algorithms make decisions about drugs according to complicated reward functions, I would expect this to happen with current capabilities. (It’s notable to me that this doesn’t already happen, given the insane hype around AI.)
Safety conscious people working on self driving cars don’t program their cars to not take evasive action after detecting that a collision is imminent.
I think it already has.(It was for extra care, not drugs, but it’s a clear cut case of a misspecified objective function leading to suboptimal decisions for a multitude of individuals.) I’ll note, perhaps unfairly, that the fact that this study was not salient enough to make it to your attention even with a culture war signal boost is evidence that it needs to be a Chernobyl level event.
I agree that Tesla does not seem very safety conscious (but it’s notable that they are still safer than human drivers in terms of fatalities per mile, if I remember correctly?)
Huh, what do you know.
Faced with an actual example, I’m realizing that what I actually expect would cause people to take it more seriously is a) the belief that AGI is near and b) an example where the AI algorithm “deliberately” causes a problem (i.e. “with full knowledge” that the thing it was doing was not what we wanted). I think most deep RL researchers already believe that reward hacking is a thing (which is what that study shows).
Tangential, but that makes it less likely that I read it; I try to completely ignore anything with the term “racial bias” in its title unless it’s directly pertinent to me. (Being about AI isn’t enough to make it pertinent to me.)
What do you expect the ML community to do at that point? Coordinate to stop or slow down the race to AGI until AI safety/alignment is solved? Or do you think each company/lab will unilaterally invest more into safety/alignment without slowing down capability research much, and that will be sufficient? Or something else?
I worry about a parallel with the “energy community”, a large part of which not just ignores but actively tries to obscure or downplay warning signs about future risks associated with certain forms of energy production. Given that the run-up to AGI will likely generate huge profits for AI companies as well as provide clear benefits for many people (compared to which, the disasters that will have occurred by then may well seem tolerable by comparison), and given probable disagreements between different experts about how serious the future risks are, it seems likely to me that AI risk will become politicized/controversial in a way similar to climate change, which will prevent effective coordination around it.
On the other hand… maybe AI will be more like nuclear power than fossil fuels, and a few big accidents will stall its deployment for quite a while. Is this why you’re relatively optimistic about AI risk being taken seriously, and if so can you share why you think nuclear power is a closer analogy?
It depends a lot on the particular warning shot that we get. But on the strong versions of warning shots, where there’s common knowledge that building an AGI runs a substantial risk of destroying the world, yes, I expect them to not build AGI until safety is solved. (Not to the standard you usually imagine, where we must also solve philosophical problems, but to the standard I usually imagine, where the AGI is not trying to deceive us or work against us.)
This depends on other background factors, e.g. how much the various actors think they are value-aligned vs. in zero-sum competition. I currently think the ML community thinks they are mostly but not fully value-aligned, and they will influence companies and governments in that direction. (I also want more longtermists to be trying to build more common knowledge of how much humans are value aligned, to make this more likely.)
The major disanalogy is that catastrophic outcomes of climate change do not personally affect the CEOs of energy companies very much, whereas AI x-risk affects everyone. (Also, maybe we haven’t gotten clear and obvious warning shots?)
I agree that my story requires common knowledge of the risk of building AGI, in the sense that you need people to predict “running this code might lead to all humans dying”, and not “running this code might lead to <warning shot effect>”. You also need relative agreement on the risks.
I think this is pretty achievable. Most of the ML community already agrees that building an AGI is high-risk if not done with some argument for safety. The thing people tend to disagree on is when we will get AGI and how much we should work on safety before then.
To the extent that we expect strong warning shots and ability to avoid building AGI upon receiving such warning shots, this seems like an argument for researchers/longtermists to work on / advocate for safety problems beyond the standard of “AGI is not trying to deceive us or work against us” (because that standard will likely be reached anyway). Do you agree?
Some types of AI x-risk don’t affect everyone though (e.g., ones that reduce the long term value of the universe or multiverse without killing everyone in the near term).
Yes.
Agreed, all else equal those seem more likely to me.
Ok, I wasn’t sure that you’d agree, but given that you do, it seems that when you wrote the title of this newsletter “Why AI risk might be solved without additional intervention from longtermists” you must have meant “Why some forms of AI risk …”, or perhaps certain forms of AI risk just didn’t come to your mind at that time. In either case it seems worth clarifying somewhere that you don’t currently endorse interpreting “AI risk” as “AI risk in its entirety” in that sentence.
Similarly, on the inside you wrote:
It seems worth clarifying that you’re only optimistic about certain types of AI safety problems.
(I’m basically making the same complaint/suggestion that I made to Matthew Barnett not too long ago. I don’t want to be too repetitive or annoying, so let me know if I’m starting to sound that way.)
Tbc, I’m optimistic about all the types of AI safety problems that people have proposed, including the philosophical ones. When I said “all else equal those seem more likely to me”, I meant that if all the other facts about the matter are the same, but one risk affects only future people and not current people, that risk would seem more likely to me because people would care less about it. But I am optimistic about the actual risks that you and others argue for.
That said, over the last week I have become less optimistic specifically about overcoming race dynamics, mostly from talking to people at FHI / GovAI. I’m not sure how much to update though. (Still broadly optimistic.)
It’s notable that AI Impacts asked for people who were skeptical of AI risk (or something along those lines) and to my eye it looks like all four of the people in the newsletter independently interpreted that as accidental technical AI risk in which the AI is adversarially optimizing against you (or at least that’s what the four people argued against). This seems like pretty strong evidence that when people hear “AI risk” they now think of technical accidental AI risk, regardless of what the historical definition may have been. I know certainly that is my default assumption when someone (other than you) says “AI risk”.
I would certainly support having clearer definitions and terminology if we could all agree on them.
Why? I actually wrote a reply that was more questioning in tone, and then changed it because I found some comments you made where you seemed to be concerned about the additional AI risks. Good thing I saved a copy of the original reply, so I’ll just paste it below:
I wonder if you would consider writing an overview of your perspective on AI risk strategy. (You do have a sequence but I’m looking for something that’s more comprehensive, that includes e.g. human safety and philosophical problems. Or let me know if there’s an existing post that I’ve missed.) I ask because you’re one of the most prolific participants here but don’t fall into one of the existing “camps” on AI risk for whom I already have good models for. It’s happened several times that I see a comment from you that seems wrong or unclear, but I’m afraid to risk being annoying or repetitive with my questions/objections. (I sometimes worry that I’ve already brought up some issue with you and then forgot your answer.) It would help a lot to have a better model of you in my head and in writing so I can refer to that to help me interpret what the most likely intended meaning of a comment is, or to predict how you would likely answer if I were to ask certain questions.
Maybe that’s because the question was asked in a way that indicated the questioner was mostly interested in technical accidental AI risk? And some of them may be fine with defining “AI risk” as “AI-caused x-risk” but just didn’t have the other risks on the top of their minds, because their personal focus is on the technical/accidental side. In other words I don’t think this is strong evidence that all 4 people would endorse defining “AI risk” as “technical accidental AI risk”. It also seems notable that I’ve been using “AI risk” in a broad sense for a while and no one has objected to that usage until now.
The current situation seems to be that we have two good (relatively clear) terms “technical accidental AI risk” and “AI-caused x-risk” and the dispute is over what plain “AI risk” should be shorthand for. Does that seem fair?
Seems right, I think my opinions fall closest to Paul’s, though it’s also hard for me to tell what Paul’s opinions are. I think this older thread is a relatively good summary of the considerations I tend to think about, though I’d place different emphases now. (Sadly I don’t have the time to write a proper post about what I think about AI strategy—it’s a pretty big topic.)
Yes, though I would frame it as “the ~5 people reading these comments have two clear terms, while everyone else uses a confusing mishmash of terms”. The hard part is in getting everyone else to use the terms. I am generally skeptical of deciding on definitions and getting everyone else to use them, and usually try to use terms the way other people use terms.
Agreed with this, but see above about trying to conform with the way terms are used, rather than defining terms and trying to drag everyone else along.
This seems odd given your objection to “soft/slow” takeoff usage and your advocacy of “continuous takeoff” ;)
I don’t think “soft/slow takeoff” has a canonical meaning—some people (e.g. Paul) interpret it as not having discontinuities, while others interpret it as capabilities increasing slowly past human intelligence over (say) centuries (e.g. Superintelligence). If I say “slow takeoff” I don’t know which one the listener is going to hear it as. (And if I had to guess, I’d expect they think about the centuries-long version, which is usually not the one I mean.)
In contrast, I think “AI risk” has a much more canonical meaning, in that if I say “AI risk” I expect most listeners to interpret it as accidental risk caused by the AI system optimizing for goals that are not our own.
(Perhaps an important point is that I’m trying to communicate to a much wider audience than the people who read all the Alignment Forum posts and comments. I’d feel more okay about “slow takeoff” if I was just speaking to people who have read many of the posts already arguing about takeoff speeds.)
AI risk is just a shorthand for “accidental technical AI risk.” To the extent that people are confused, I agree it’s probably worth clarifying the type of risk by adding “accidental” and “technical” whenever we can.
However, I disagree with the idea that we should expand the word AI risk to include philosophical failures and intentional risks. If you open the term up, these outcomes might start to happen:
It becomes unclear in conversation what people mean when they say AI risk
Like The Singularity, it becomes a buzzword.
Journalists start projecting Terminator scenarios onto the words, and now have justification because even the researchers say that AI risk can mean a lot of different things.
It puts a whole bunch of types of risk into one basket, suggesting to outsiders that all attempts to reduce “AI risk” might be equally worthwhile.
ML researchers start to distrust AI risk researchers, because people who are worried about the Terminator are using the same words as the AI risk researchers and therefore get associated with them.
This can all be avoided by having a community norm to clarify that we mean technical accidental risk when we say AI risk, and when we’re talking about other types of risks we use more precise terminology.
I don’t think “AI risk” was originally meant to be a shorthand for “accidental technical AI risk”. The earliest considered (i.e., not off-hand) usage I can find is in the title of Luke Muehlhauser’s AI Risk and Opportunity: A Strategic Analysis where he defined it as “the risk of AI-caused extinction”.
(He used “extinction” but nowadays we tend think in terms of “existential risk” which also includes “permanent large negative consequences”, which seems like an reasonable expansion of “AI risk”.)
I want to include philosophical failures, as long as the consequences of the failures flow through AI, because (aside from historical usage) technical problems and philosophical problems blend into each other, and I don’t see a point in drawing an arbitrary and potentially contentious border between them. (Is UDT a technical advance or a philosophical advance? Is defining the right utility function for a Sovereign Singleton a technical problem or a philosophical problem? Why force ourselves to answer these questions?)
As for “intentional risks” it’s already common practice to include that in “AI risk”:
Besides that, I think there’s also a large grey area between “accident risk” and “misuse” where the risk partly comes from technical/philosophical problems and partly from human nature. For example humans might be easily persuaded by wrong but psychologically convincing moral/philosophical arguments that AIs can come up with and then order their AIs to do terrible things. Even pure intentional risks might have technical solutions. Again I don’t really see the point of trying to figure out which of these problems should be excluded from “AI risk”.
It seems perfectly fine to me to use that as shorthand for “AI-caused x-risk” and use more specific terms when we mean more specific risks.
What do you mean? Like people will use “AI risk” when their project has nothing to do with “AI-caused x-risk”? Couldn’t they do that even if we define “AI risk” to be “accidental technical AI risk”?
Terminator scenarios seem to be scenarios of “accidental technical AI risk” (they’re just not very realistic scenarios) so I don’t see how defining “AI risk” to mean that would prevent journalists from using Terminator scenarios to illustrate “AI risk”.
I don’t think this is a good argument, because even within “accidental technical AI risk” there are different problems that aren’t equally worthwhile to solve, so why aren’t you already worried about outsiders thinking all those problems are equally worthwhile?
See my response above regarding “Terminator scenarios”.
I propose that we instead stick with historical precedent and keep “AI risk” to mean “AI-caused x-risk” and use more precise terminology to refer to more specific types of AI-caused x-risk that we might want to talk about. Aside from what I wrote above, it’s just more intuitive/commonsensical that “AI risk” means “AI-caused x-risk” in general instead of a specific kind of AI-caused x-risk.
However I appreciate that someone who works mostly on the less philosophical / less human-related problems might find it tiresome to say or type “technical accidental AI risk” all the time to describe what they do or to discuss the importance of their work, and can find it very tempting to just use “AI risk”. It would probably be good to create a (different) shorthand or acronym for it to remove this temptation and to make their lives easier.
I appreciate the arguments, and I think you’ve mostly convinced me, mostly because of the historical argument.
I do still have some remaining apprehension about using AI risk to describe every type of risk arising from AI.
That is true. The way I see it, UDT is definitely on the technical side, even though it incorporates a large amount of philosophical background. When I say technical, I mostly mean “specific, uses math, has clear meaning within the language of computer science” rather than a more narrow meaning of “is related to machine learning” or something similar.
My issue with arguing for philosophical failure is that, as I’m sure you’re aware, there’s a well known failure mode of worrying about vague philosophical problems rather than more concrete ones. Within academic philosophy, the majority of discussion surrounding AI is centered around consciousness, intentionality, whether it’s possible to even construct a human-like machine, whether they should have rights etc.
There’s a unique thread of philosophy that arose from Lesswrong, which includes work on decision theory, that doesn’t focus on these thorny and low priority questions. While I’m comfortable with you arguing that philosophical failure is important, my impression is that the overly philosophical approach used by many people has done more harm than good for the field in the past, and continues to do so.
It is therefore sometimes nice to tell people that the problems that people work on here are concrete and specific, and don’t require doing a ton of abstract philosophy or political advocacy.
This is true, but my impression is that when you tell people that a problem is “technical” it generally makes them refrain from having a strong opinion before understanding a lot about it. “Accidental” also reframes the discussion by reducing the risk of polarizing biases. This is a common theme in many fields:
Physicists sometimes get frustrated with people arguing about “the philosophy of the interpretation of quantum mechanics” because there’s a large subset of people who think that since it’s philosophical, then you don’t need to have any subject-level expertise to talk about it.
Economists try to emphasize that they use models and empirical data, because a lot of people think that their field of study is more-or-less just high status opinion + math. Emphasizing that there are real, specific models that they study helps to reduce this impression. Same with political science.
A large fraction of tech workers are frustrated about the use of Machine Learning as a buzzword right now, and part of it is that people started saying Machine Learning = AI rather than Machine Learning = Statistics, and so a lot of people thought that even if they don’t understand statistics, they can understand AI since that’s like philosophy and stuff.
Scott Aaronson has said
My guess is that this shift in his thinking occurred because a lot of people started talking about technical risks from AI, rather than framing it as a philosophy problem, or a problem of eliminating bad actors. Eliezer has shared this viewpoint for years, writing in the CEV document,
reflecting the temptation to derail discussions about technical accidental risks.
Also, isn’t defining “AI risk” as “technical accidental AI risk” analogous to defining “apple” as “red apple” (in terms of being circular/illogical)? I realize natural language doesn’t have to be perfectly logical, but this still seems a bit too egregious.
I agree that this is troubling, though I think it’s similar to how I wouldn’t want the term biorisk to be expanded to include biodiversity loss (a risk, but not the right type), regular human terrorism (humans are biological, but it’s a totally different issue), zombie uprisings (they are biological, but it’s totally ridiculous), alien invasions etc.
Not to say that’s what you are doing with AI risk. I’m worried about what others will do with it if the term gets expanded.
Well as I said, natural language doesn’t have to be perfectly logical, and I think “biorisk” is in somewhat in that category but there’s an explanation that makes it a bit reasonable than it might first appear, which is that the “bio” refers not to “biological” but to “bioweapon”. This is actually one of the definitions that Google gives when you search for “bio”: “relating to or involving the use of toxic biological or biochemical substances as weapons of war. ‘bioterrorism’”
I guess the analogous thing would be if we start using “AI” to mean “technical AI accidents” in a bunch of phrases, which feels worse to me than the “bio” case, maybe because “AI” is a standalone word/acronym instead of a prefix? Does this make sense to you?
But the term was expanded from the beginning. Have you actually observed it being used in ways that you fear (and which would be prevented if we were to redefine it more narrowly)?
Yeah that makes sense. Your points about “bio” not being short for “biological” were valid, but the fact that as a listener I didn’t know that fact implies that it seems really easy to mess up the language usage here. I’m starting to think that the real fight should be about using terms that aren’t self explanatory.
I’m not sure about whether it would have been prevented by using the term more narrowly, but in my experience the most common reaction people outside of EA/LW (and even sometimes within) have to hearing about AI risk is to assume that it’s not technical, and to assume that it’s not about accidents. In that sense, I have seen been exposed to quite a bit of this already.
Tangential, but I wouldn’t be surprised if researchers were fairly quickly aware of the issue (e.g. within two years of the original GAN paper), but it took a while to become common knowledge because it isn’t particularly flashy. (There’s a surprising-to-me amount of know-how that is stored in researcher’s brains and never put down on paper.)
I mean, the solution is to use a VAE. If you care about covering modes but not image quality, you choose a VAE; if you care about image quality but not covering modes, you choose a GAN.
(Also, while I know very little about VAEs / GANs, Implicit Maximum Likelihood Estimation sounded like a principled fix to me.)
Agreed, I would guess that the researchers / engineers knew this was risky and thought it was worth it anyway. Or perhaps the managers did. But I do agree this is evidence against my position.
Why isn’t the threat clear once the problems are discovered?
Part of my claim is that we probably will get that (assuming AI really is risky), though perhaps not Chernobyl-level disaster, but still something with real negative consequences that “could be worse”.
I think I should be more specific, when you say:
I mean that no one sane who knows that will run that AI system with > X amount of computing power. When I wrote that comment I also thought that no one sane would not blow the whistle in that event. See my note at the end of the comment.*
However, when presented with that evidence, I don’t expect the AI community to react appropriately. The correct response to that evidence is to stop what your doing, and revisit the entire process and culture that led to the creation of an algorithm that will kill us all if run with >X amount of compute. What I expect will happen is that the AI community will try and solve the problem the same way it’s solved every other problem it has encountered. It will try an inordinate amount of unprincipled hacks to get around the issue.
Conditional on no FOOM, I can definitely see plenty of events with real negative consequences that “could be worse”. However, I claim that anything short of a Chernobyl level event won’t shock the community and the world into changing it’s culture or trying to coordinate. I also claim that the capabilities gap between a Chernobyl level event and a global catastrophic event is small, such that even in a non-FOOM scenario the former might not happen before the latter. Together, I think that there is a high probability that we will not get a disaster that is scary enough to get the AI community to change it’s culture and coordinate before it’s too late.
*Now that I think about it more though, I’m less sure. Undergraduate engineers get entire lectures dedicated to how and when to blow the whistle when faced with unethical corporate practices and dangerous projects or designs. When working, they also have insurance and some degree of legal protection from vengeful employers. Even then, you still see cover ups of shortcomings that lead to major industrial disasters. For instance, long before the disaster, someone had determined that the fukushima plant was indeed vulnerable to large tsunami impacts. The pattern where someone knows that something will go wrong but nothing is done to prevent it for one reason or another is not that uncommon in engineering disasters. Regardless of whether this is due to hindsight bias or an inadequate process for addressing safety issues, these disasters still happen regularly in fields with far more conservative, cautious, and safety oriented cultures.
I find it unlikely that the field of AI will change it’s culture from one of moving fast and hacking to something even more conservative and cautious than the cultures of consumer aerospace and nuclear engineering.
Idk, I don’t know what to say here. I meet lots of AI researchers, and the best ones seem to me to be quite thoughtful. I can say what would change my mind:
I take the exploration of unprincipled hacks as very weak evidence against my position, if it’s just in an academic paper. My guess is the researchers themselves would not advocate deploying their solution, or would say that it’s worth deploying but it’s an incremental improvement that doesn’t solve the full problem. And even if the researchers don’t say that, I suspect the companies actually deploying the systems would worry about it.
I would take the deployment of unprincipled hacks more seriously as evidence, but even there I would want to be convinced that shutting down the AI system was a better decision than deploying an unprincipled hack. (Because then I would have made the same decision in their shoes.)
Unprincipled hacks are in fact quite useful for the vast majority of problems; as a result it seems wrong to attribute irrationality to people because they use unprincipled hacks.
Would it be fair to summarize your view here as “Assuming no foom, we’ll be able to iterate, and that’s probably enough.”?
Hmm, I think I’d want to explicitly include two other points, that are kind of included in that but don’t get communicated well by that summary:
There may not be a problem at all; perhaps by default powerful AI systems are not goal-directed.
If there is a problem, we’ll get evidence of its existence before it’s too late, and coordination to not build problematic AI systems will buy us additional time.
Cool, just wanted to make sure I’m engaging with the main argument here. With that out of the way...
I generally buy the “no foom ⇒ iterate ⇒ probably ok” scenario. There are some caveats and qualifications, but broadly-defined “no foom” is a crux for me—I expect at least some kind of decisive strategic advantage for early AGI, and would find the “aligned by default” scenario plausible in a no-foom world.
I do not think that a lack of goal-directedness is particularly relevant here. If an AI has extreme capabilities, then a lack of goals doesn’t really make it any safer. At some point I’ll probably write a post about Don Norman’s fridge which talks about this in more depth, but the short version is: if we have an AI with extreme capabilities but a confusing interface, then there’s a high chance that we all die, goal-direction or not. In the “no foom” scenario, we’re assuming the AI won’t have those extreme capabilities, but it’s foom vs no foom which matters there, not goals vs no goals.
I also disagree with coordination having any hope whatsoever if there is a problem. There’s a huge unilateralist problem there, with millions of people each easily able to push the shiny red button. I think straight-up solving all of the technical alignment problems would be much easier than that coordination problem.
Looking at both the first and third point, I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop. (Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.)
EDIT: to clarify, those second two points are both conditioned on foom. Point being, the only thing which actually matters here is foom vs no foom:
if there’s no foom, then we can probably iterate, and then we’re probably fine anyway (regardless of goal-direction, coordination, etc).
if there’s foom, then a lack of goal-direction won’t help much, and coordination is unlikely to work.
Yeah, I think I mostly agree with this.
Yeah, I agree with that (assuming “extreme capabilities” = rearranging atoms however it sees fit, or something of that nature), but why must it have a confusing interface? Couldn’t you just talk to it, and it would know what you mean? So I do think the goal-directed point does matter.
I agree that this is a sub-crux. Note that I believe that eventually human-level AGI will be able to run on a laptop, just that it will be preceded by human-level AGIs that take more compute.
I tend to think that if problems arise, you’ve mostly lost already, so I’m actually happier about decisive strategic advantage because it reduces competitive pressure.
But tbc, I broadly agree with all of your points, and do think that in FOOM worlds most of my arguments don’t work. (Though I continue to be confused what exactly a FOOM world looks like.)
That’s where the Don Norman part comes in. Interfaces to complicated systems are confusing by default. The general problem of systematically building non-confusing interfaces is, in my mind at least, roughly equivalent to the full technical problem of AI alignment. (Writing a program which knows what you mean is also, in my mind, roughly equivalent to the full technical problem of AI alignment.) A wording which makes it more obvious:
The main problem of AI alignment is to translate what a human wants into a format usable by a machine
The main problem of user interface design is to help/allow a human to translate what they want into a format usable by a machine
Something like e.g. tool AI puts more of the translation burden on the human, rather than on the AI, but that doesn’t make the translation itself any less difficult.
In a non-foomy world, the translation doesn’t have to be perfect—humanity won’t be wiped out if the AI doesn’t quite perfectly understand what we mean. Extreme capabilities make high-quality translation more important, not just because of Goodhart, but because the translation itself will break down in scenarios very different from what humans are used to. So if the AI has the capabilities to achieve scenarios very different from what humans are used to, then that translation needs to be quite good.
Do you agree that an AI with extreme capabilities should know what you mean, even if it doesn’t act in accordance with it? (This seems like an implication of “extreme capabilities”.)
No. The whole notion of a human “meaning things” presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.
That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI’s concept-of-what-we-mean—i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it—and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.
That’s the crux for me; I expect AI systems that we build to be capable of “knowing what you mean” (using the appropriate level of abstraction). They may also use other levels of abstraction, but I expect them to be capable of using that one.
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?
If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).
It’s been argued before that Continuous is not the same as Slow by any normal standard, so the strategy of ‘dealing with things as they come up’, while more viable under a continuous scenario, will probably not be sufficient.
It seems to me like you’re assuming longtermists are very likely not required at all in a case where progress is continuous. I take continuous to just mean that we’re in a world where there won’t be sudden jumps in capability, or apparently useless systems suddenly crossing some threshold and becoming superintelligent, not where progress is slow or easy to reverse. We could still pick a completely wrong approach that makes alignment much more difficult and set ourselves on a likely path towards disaster, even if the following is true:
In a world where continuous but moderately fast takeoff is likely, I can easily imagine doom scenarios that would require long term strategy or conceptual research early on to avoid, even if none of them involve FOOM. Imagine that the accepted standard for aligned AI is follows some particular research agenda, like Cooperative Inverse Reinforcement Learning, but it turns out that CIRL starts to behave pathologically and tries to wirehead itself as it gets more and more capable, and that its a fairly deep flaw that we can only patch and not avoid.
Let’s say that over the course of a couple of years failures of CIRL systems start to appear and compound very rapidly until they constitute an Existential disaster. Maybe people realize what’s going on, but by then it would be too late, because the right approach would have been to try some other approach to AI alignment but the research to do that doesn’t exist and can’t be done anywhere near fast enough. Like Paul Christiano’s what failure looks like
In the situations you describe, I would still be somewhat optimistic about coordination. But yeah, such situations leading to doom seem plausible, and this is why the estimate is 90% instead of 95% or 99%. (Though note that the numbers are very rough.)
It seems that the interviewees here either:
Use “AI risk” in a narrower way than I do.
Neglected to consider some sources/forms of AI risk (see above link).
Have considered other sources/forms of AI risk but do not find them worth addressing.
Are worried about other sources/forms of AI risk but they weren’t brought up during the interviews.
Can you talk about which of these is the case for yourself (Rohin) and for anyone else whose thinking you’re familiar with? (Or if any of the other interviewees would like to chime in for themselves?)
For context, here’s the one time in the interview I mention “AI risk” (quoting 2 earlier paragraphs for context):
(But it’s still the case that asked “Can you explain why it’s valuable to work on AI risk?” I responded by almost entirely talking about AI alignment, since that’s what I work on and the kind of work where I have a strong view about cost-effectiveness.)
We discussed this here for my interview; my answer is the same as it was then (basically a combination of 3 and 4). I don’t know about the other interviewees.
This sort of reasoning seems to assume that abstraction space is 1 dimensional, so AI must use human concepts on the path from subhuman to superhuman. I disagree. Like most things we don’t have strong reason to think is 1D, and which take many bits of info to describe, abstractions seem high dimensional. So on the path from subhuman to superhuman, the AI must use abstractions that are as predicatively useful as human abstractions. These will not be anything like human abstractions unless the system was designed from a detailed neurological model of humans. Any AI that humans can reason about using our inbuilt empathetic reasoning is basically a mind upload, or a mind that differs from human less than humans differ from each other. This is not what ML will create. Human understanding of AI systems will have to be by abstract mathematical reasoning, the way we understand formal maths. Empathetic reasoning about human level AI is just asking for anthropomorphism. Our 3 options are
1) An AI we don’t understand
2) An AI we can reason about in terms of maths.
3) A virtual human.
While I might agree with the three options at the bottom, I don’t agree with the reasoning to get there.
Abstractions are pretty heavily determined by the territory. Humans didn’t look at the world and pick out “tree” as an abstract concept because of a bunch of human-specific factors. “Tree” is a recurring pattern on earth, and even aliens would notice that same cluster of things, assuming they paid attention. Even on the empathic front, you don’t need a human-like mind in order to notice the common patterns of human behavior (in humans) which we call “anger” or “sadness”.
+1, that’s my response as well.
Some abstractions are heavily determined by the territory. The concept of trees is pretty heavily determined by the territory. Whereas the concept of betrayal is determined by the way that human minds function, which is determined by other people’s abstractions. So while it seems reasonably likely to me that an AI “naturally thinks” in terms of the same low-level abstractions as humans, it thinking in terms of human high-level abstractions seems much less likely, absent some type of safety intervention. Which is particularly important because most of the key human values are very high-level abstractions.
My guess is that if you have to deal with humans, as at least early AI systems will have to do, then abstractions like “betrayal” are heavily determined.
I agree that if you don’t have to deal with humans, then things like “betrayal” may not arise; similarly if you don’t have to deal with Earth, then “trees” are not heavily determined abstractions.
Neural nets have around human performance on Imagenet.
If abstraction was a feature of the territory, I would expect the failure cases to be similar to human failure cases. Looking at https://github.com/hendrycks/natural-adv-examples, This does not seem to be the case very strongly, but then again, some of them contain dark shiny stone being classified as a sea lion. The failures aren’t totally inhuman, the way they are with adversarial examples.
I am not saying that trees aren’t a cluster in thing space. What I am saying is that if there were many cluster in thing space that were as tight and predicatively useful as “Tree”, but were not possible for humans to conceptualize, we wouldn’t know it. There are plenty of concepts that humans didn’t develop for most of human history, despite those concepts being predicatively useful, until an odd genius came along or the concept was pinned down by massive experimental evidence. Eg inclusive genetic fitness, entropy ect.
Consider that evolution optimized us in an environment that contained trees, and in which predicting them was useful, so it would be more surprising for there to be a concept that is useful in the ancestral environment that we can’t understand, than a concept that we can’t understand in a non ancestral domain.
This looks like a map that is heavily determined by the territory, but human maps contain rivers and not geological rock formations. There could be features that could be mapped that humans don’t map.
If you believe the post that
Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise. Of course, an AI trained on text might share our concepts just because our concepts are the most predicatively useful ways to predict our writing. I would also like to assign some probability to AI systems that don’t use anything recognizable as a concept. You might be able to say 90% of blue objects are egg shaped, 95% of cubes are red … 80% of furred objects that glow in the dark are flexible … without ever splitting objects into bleggs and rubes. Seen from this perspective, you have a density function over thingspace, and a sum of clusters might not be the best way to describe it. AIXI never talks about trees, it just simulates every quantum. Maybe there are fast algorithms that don’t even ascribe discrete concepts.
But those trained neural nets are very subhuman on other image understanding tasks.
I would expect that the alien concepts are something we haven’t figured out because we don’t have enough data or compute or logic or some other resource, and that constraint will also apply to the AI. If you take that concept and “add random noise” (which I don’t really understand), it would presumably still require the same amount of resources, and so the AI still won’t find it.
For the rest of your comment, I agree that we can’t theoretically rule those scenarios out, but there’s no theoretical reason to rule them in either. So far the empirical evidence seems to me to be in favor of “abstractions are determined by the territory”, e.g. ImageNet neural nets seems to have human-interpretable low-level abstractions (edge detectors, curve detectors, color detectors), while having strange high-level abstractions; I claim that the strange high-level abstractions are bad and only work on ImageNet because they were specifically designed to do so and ImageNet is sufficiently narrow that you can get to good performance with bad abstractions.
By adding random noise, I meant adding wiggles to the edge of the set in thingspace for example adding noise to “bird” might exclude “ostrich” and include “duck bill platypus”.
I agree that the high level image net concepts are bad in this sense, however are they just bad. If they were just bad and the limit to finding good concepts was data or some other resource, then we should expect small children and mentally impaired people to have similarly bad concepts. This would suggest a single gradient from better to worse. If however current neural networks used concepts substantially different from small children, and not just uniformly worse or uniformly better, that would show different sets of concepts at the same low level. This would be fairly strong evidence of multiple concepts at the smart human level.
I would also want to point out that a small fraction of the concepts being different would be enough to make alignment much harder. Even if their was a perfect scale, if 1⁄3 of the concepts are subhuman, 1⁄3 human level and 1⁄3 superhuman, it would be hard to understand the system. To get any safety, you need to get your system very close to human concepts. And you need to be confidant that you have hit this target.
From the transcript with Paul Christiano.
I don’t understand. Maybe it is just the case that there’s no value left after a large number of things that reduces the expected value by 10%?
Paul is implicitly conditioning his actions on being in a world where there’s a decent amount of expected value left for his actions to affect. This is technically part of a decision procedure, rather than a statement about epistemic credences, but it’s confusing because he frames it as an epistemic credence.
See also this thread.
Nice to see that there are not just radical positions in the AI safety crowd, and there is a drift away from alarmism and towards “let’s try various approaches, iterate and see what we can learn” instead of “we must figure out AI safety first, or else!” Also, Christiano’s approach “let’s at least ensure we can build something reasonably safe for the near term”, since one way or another, something will get built, has at least a chance of success.
My personal guess, as someone who knows nothing about ML and very little about AI safety, but a non-zero amount about research and development in general, is that the embedded agency problems are way too deep to be satisfactorily resolved before ML gets the AI to the level of an average programmer. But maybe MIRI, like the NSA, has a few tricks up its sleeve that are not visible to the general public. Though this does not seem likely, otherwise a lot of the recent discussions of embedded agency would be smoke and mirrors, not something MIRI is likely to engage in.
This was a very illuminating newsletter. It is nice to hear a diversity of perspectives on alignment.
How accurate is it to say that MIRI believes a mathematical solution to the alignment problem is the only solution? Does MIRI think that without a formal proof of an AGI’s safety, it will cause human extinction?
″ There can’t be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left. ”
This is the argument from consequences fallacy. There may be many things that could destroy the future with high probability and we are simply doomed BUT the more interesting scenario and a much better working assumption is that there potentially dangerous things that are likely to destroy the future IF we don’t seek to understand them and try to correct them by concerted effort as opposed to continuing on as we do now with teh level of effort and concern we have now.
See this response.