when we look at the distribution of opinion among those who have really “engaged with the arguments”, we are left with a substantial majority—maybe everyone but Hanson, depending on how stringent our standards are here!—who do believe that, one way or another, AI development poses a serious existential risk.
For what it’s worth, I have “engaged with the arguments” but am still skeptical of the main arguments. I also don’t think that my optimism is very unusual for people who work on the problem, either. Based on an image image from about five years ago (the same time Nick Bostrom’s book came out), most people at FHI were pretty optimistic. Since then, it’s my impression that researchers have become even more optimistic, since more people appear to accept continuous takeoff and there’s been a shift in arguments. AI Impacts recently interviewed a few researchers who were also skeptical (including Hanson), and all of them have engaged in the main arguments. It’s unclear to me that their opinions are actually substantially more optimistic than average.
The set of arguments that are being actively discussed by AI safety researchers obviously changed since 2014 (which is true for any active field?). I assume that by “there’s been a shift in arguments” you mean something more than that, but I’m not sure what.
Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers? Does the progress in deep learning since 2014 made the core arguments in the book less compelling? (Do the arguments about instrumental convergence and Goodhart’s law fail to apply to deep RL?)
If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I’d assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.
Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers?
I can’t speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I’ve talked to. In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).
This makes sense. However, I’d still point out that this is evidence that the arguments weren’t convincing, since otherwise they would have used the same arguments, even though they are different people.
If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I’d assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.
I’m not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem). That doesn’t seem to me like meaningful evidence for the proposition “the arguments in Superintelligence are not sound”.
I can’t speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I’ve talked to.
It’s been a while since I read listened to the audiobook version of Superintelligence, but I don’t recall the book arguing that the “second‐place AI lab” will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence. And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?
In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).
I don’t recall the book relying on (or [EDIT: with a lot less confidence] even mentioning the possibility of) a discontinuity in capabilities. I believe it does argue that once there are AI systems that can do anything humans can, we can expect extremely fast progress.
I’m not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem)
I would call the inner alignment problem a refinement of the traditional argument from AI risk. The traditional argument was that there was going to be a powerful system that had a utility function it was maximizing and it might not match ours. Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
If all the new arguments were mere refinements of the old ones, then my argument would not work. I don’t think that all the new ones are refinements of the old ones, however. For an example, try to map what failure looks like onto Nick Bostrom’s model for AI risk. Influence-seeking sorta looks like what Nick Bostrom was talking about, but I don’t think “Going out with a whimper” is what he had in mind (I haven’t read the book in a while though).
It’s been a while since I read Superintelligence, but I don’t recall the book arguing that the “second‐place AI lab” will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence.
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes, where a single team gains a decisive strategic advantage over the rest of the world (which seems impossible unless a single team surges forward in development). Robin Hanson had the same critique in his review of the book.
And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?
If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
It’s possible that we don’t have good arguments yet, but good arguments could present themselves eventually and it would be too late at that point to go back in time and ask people in the past to start work on AI safety. I agree with this heuristic (though it’s weak, and should only be used if there are not other more pressing existential risks to work on).
I also agree that there are conceptual arguments for why we should start AI safety work now, and I’m not totally convinced that the future will be either kind or safe to humanity. It’s worth understanding the arguments for and against AI safety, lest we treat it as a team to be argued for.
Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
As I understand the language, the “loss function used to train our AIs” matches “our objective function” from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument” (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument”
By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don’t get a second chance.
The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
I don’t think there’s a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.
I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.
In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?
Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.
You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.
And since there’s no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.
This framing really helped me think about gradual self-improvement, thanks for writing it down!
I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there’s some sense of intent that hasn’t been explicitly happening for the past thousand years.
Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban’s framing) that somehow has some agency. But it’s much less effective at self-improving itself, and it’s not thinking “yep, I need to invent this new science to optimize this utility function”. I agree that the threshold is “when all the relevant action is from a single system improving itself”.
there would also be warning signs before it was too late
And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe “single system improving itself”? How will all the relevant actors behave in the meantime?
My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.
For what it’s worth, I have “engaged with the arguments” but am still skeptical of the main arguments. I also don’t think that my optimism is very unusual for people who work on the problem, either.
I’m curious if you’ve seen The Main Sources of AI Risk? Have you considered all of those sources/kinds of risks and still think that the total AI-related x-risk is not very large?
[ETA: It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I’m pessimistic in a sense, since I think by default our future civilization’s values will be quite different from mine in important ways.]
My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it’s difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think
We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
The way our values might go extinct will probably happen in some unavoidable manner that’s not related to the typical sources of AI risk. In other words, it’s likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.
If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I’m skeptical of very short timelines.
Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.
It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling.
May I beseech you to be more careful about using “optimism” and words like it in the future, because I’m really worried about strategy researchers and decision makers getting the wrong impression from AI safety researchers about how hard the overall AI risk problem is, and for some reason I keep seeing people say that they’re “optimistic” (or other words to that effect) when they mean optimistic about some sub-problem of AI risk instead of AI risk as a whole, but they don’t make that clear. In many cases it’s pretty predictable that people outside technical AI safety research (or even inside, like in this case) would often misinterpret that as being optimistic about AI risk.
Note that lack of ability to know what alignment work would be useful to do ahead of time increases, rather than decreases, the absolute level of risk; thus, it increases rather than decreases the risk metrics (e.g. probability of humans being wiped out) that FHI estimated.
It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.
The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.
It seems even harder to do productive work, since I’m skeptical of very short timelines.
Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.
I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn’t put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.
In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).
Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety?
A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don’t understand a whole lot about those architectures, this limits our abilities to do concrete work. I don’t find the argument entirely compelling because,
It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
We can still start institutional work, and develop important theoretical insights.
But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.
And it seems to me like “probably helps somewhat” is enough when it comes to existential risk
I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.
It follows that most work during this pre-AGI period might miss important details and be less effective than we think.
Do you think AI alignment researchers have not taken this into consideration already? For example, I’m pretty sure I’ve read arguments from Paul Christiano for why he is working on his approach even though we don’t know how AGI will be built. MIRI people have made such arguments too, I think.
I’m not claiming any sort of knock-down argument. I understand that individual researchers often have very thoughtful reasons for thinking that their approach will work. I just take the heuristic seriously that it is very difficult to predict the future, or to change the course of history in a predictable way. My understanding of past predictions of the future is that they have been more-or-less horrible, and so skepticism of any particular line of research is pretty much always warranted.
In case you think AI alignment researchers are unusually good at predicting the future, and you would put them in a different reference class, I will point out that the type of AI risk stuff people on Lesswrong talk about now is different in meaningful ways to the stuff that was talked about five or ten years ago on here.
To demonstrate, a common assumption was that in the absence of advanced AI architecture design, we could minimally assume that an AI would maximize a utility function, since a utility function is a useful abstraction that seems robust to architectural changes in our underlying AI designs or future insights. The last few years has seen many people here either rejecting this argument, or finding it to be vacuous, or underspecified as an argument. (I’m not taking a hard position, I’m merely pointing out that this shift has occurred).
People also assumed that, in the absence of advanced AI architecture design, we could assume that an AI’s first priority would be to increase it’s own intelligence, prompting researchers to study stable recursive self-improvement. Again, the last few years has seen people here rejecting this argument, or concluding that it’s not a priority for research. (Once again, I’m not here to argue whether this specific shift was entirely justified).
I suspect that even very reasonable sounding arguments of the type, “Well, we might not know what AI will look like, but mimimally we can assume X, and X is a tractable line of research” will turn out to be suspicious in the end. That’s not to say that some of these arguments won’t be correct. Perhaps, if we’re very carfeul, we can find out which ones are correct. I just have a strong heuristic of assuming future cluelessness.
When you say “the last few years has seen many people here” for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?
I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don’t remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.
In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven’t fully catched your opinion on that.
When you say “the last few years has seen many people here” for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?
For the utility of talking about utility functions, see this rebuttal of an argument justifying the use of utility functions by appealing to the VNM-utility theorem, and a fewmoreposts expanding the discussion. The CAIS paper argues that we shouldn’t model future AI as having monolithic long-term utility function. But it’s by no means a settled debate.
For the rejection of stable self improvement as a research priority, Paul Christiano wrote a post in 2014 where he argued that stable recursive self improvement will be solved a special case of reasoning under uncertainty. And again, the CAIS model proposes that technological progress will feed into itself (not unlike what already happens), rather than a monolithic agent improving itself.
I get the impression that very few people outside of MIRI work on studying stable recursive self improvement, though this might be because they think it’s not their comparative advantage.
I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don’t remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.
There’s a difference between accepting something as a theoretical problem, and accepting that it’s a tractable research priority. I was arguing that the type of work we do right now might not be useful for future researchers, and so I wasn’t trying to say that these things didn’t exist. Rather, it’s not clear that productive work can be done on them right now. My evidence was that the way we think about these problems has changed over the years. Of course, you could say that the reason why the research focuses shifted is because we made progress, but I’d be skeptical about that hypothesis.
In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven’t fully catched your opinion on that.
I don’t quite understand the question? It’s my understanding that I was disputing a notion that the inner alignment should count as a “shift in arguments” for AI risk. I claimed that it was a refinement of the traditional arguments; more specifically, we decomposed the value alignment problem into two levels. I’m quite confused at what I’m missing here.
Thanks for all the references! I don’t currently have much time to read all of it right now so I can’t really engage with the specific arguments for the rejection of using utility functions/studying recursive self-improvement.
I essentially agree with most of what you wrote. There is maybe a slight disagreement in how you framed (not what you meant) how research focus shifted since 2014.
I see Superintelligence as essentially saying “hey, there is pb A. And even if we solve A, then we might also have B. And given C and D, there might be E.” Now that the field is more mature and we have many more researchers getting paid to work on these problems, the arguments became much more goal focused. Now people are saying “I’m going to make progress on sub-problem X, by publishing a paper on Y. And working on Z is not cost-effective given so I’m not going to work on it given humanity’s current time constraints.”
These approaches are often grouped as “long-term problems-focused” and “making tractable progress now focused”. In the first group you have Yudkowsky 2010, Bostrom 2014, MIRI’s current research and maybe CAIS. In the second one you have current CHAI/FHI/OpenAI/DeepMind/Ought papers.
Your original framing can be interpreted as “after proving some mathematical theorems, people rejected the main arguments of Superintelligence and now most of the community agrees that working on X, Y and Z are tractable but A, B and C are more controversials”.
I think a more nuanced and precise framing would be: “In Superintelligence Bostrom exposes exhaustively the risks associated with advanced AI. A short portion of the book is dedicated to the problems are working on right now. Indeed, people stopped working on the other problems (largest portion of the book) because 1) there hasn’t been really productive working on them 2) some rebuttals have been written online giving convincing arguments that those pbs are not tractable anyway 3) there are now well-funded research organizations with incentives to make tangible progress on those pbs.”
In your last framing, you presented precise papers/rebuttals (thanks again!) for 2), and I think rebuttals are a great reason to stop working on a pb, but I think they’re not the only reason and not the real reason people stopped working on those pb. To be fair, I think 1) can be explained by many more factors than “it’s theoretically impossible to make progress on those pbs”. It can be that the research mindset required to work on these pbs is less socially/intellectually validating or requires much more theoretical approaches, so will be off-putting/tiresome to most recent grads that enter the field. I also think that AI Safety is now much more intertwined with evidence-based approaches such as Effective Altruism than it was in 2014, which explains 3), so people start presenting their research as “partial solutions to the pb. of AI Safety” or “research agenda”.
To be clear, I’m not criticizing the current shift in research. I think it’s productive for the field, both in the short term and long term. To give a bit more personal context, I started getting interested in AI Safety after reading Bostrom and have always been more interested in the “finding problems” approach. I went to FHI to work on AI Safety because I was super interested in finding new pbs related to the treacherous turn. It’s now almost taboo to say that we’re working on pbs that are sub-optimally minimizing AI risk, but the real reason that pushed me to think about those pbs was because they were both important and interesting. The pb. with the current “shift in framing” is that it’s making it socially unacceptable for people to think/work on more long-term pbs where there is more variance in research productivity.
I don’t quite understand the question?
Sorry about that. I thought there was some link to our discussion about utility functions but I misunderstood.
EDIT: I also wanted to mention that the number of pages in a book doesn’t account for how important the author think the pb. is (Bostrom even comments on this in the postface of its book). Again, the book is mostly about saying “here are all the pbs”, not “these are the tractable pbs we should start working on, and we should dedicate research ressources proportionally to the amount of pages I talk about it in the book”.
I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field’s understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don’t think the shift in arguments at all justifies the conclusion that prior work wasn’t very helpful, as the prior work could have been necessary to achieve that very shift.
I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It’s possible that by that time it would be “too late” as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I’m pretty skeptical of this though).
For what it’s worth, I have “engaged with the arguments” but am still skeptical of the main arguments. I also don’t think that my optimism is very unusual for people who work on the problem, either. Based on an image image from about five years ago (the same time Nick Bostrom’s book came out), most people at FHI were pretty optimistic. Since then, it’s my impression that researchers have become even more optimistic, since more people appear to accept continuous takeoff and there’s been a shift in arguments. AI Impacts recently interviewed a few researchers who were also skeptical (including Hanson), and all of them have engaged in the main arguments. It’s unclear to me that their opinions are actually substantially more optimistic than average.
The set of arguments that are being actively discussed by AI safety researchers obviously changed since 2014 (which is true for any active field?). I assume that by “there’s been a shift in arguments” you mean something more than that, but I’m not sure what.
Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers? Does the progress in deep learning since 2014 made the core arguments in the book less compelling? (Do the arguments about instrumental convergence and Goodhart’s law fail to apply to deep RL?)
If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I’d assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.
I can’t speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I’ve talked to. In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).
I get the sense that a lot of it is different people writing about it rather than people changing their minds.
This makes sense. However, I’d still point out that this is evidence that the arguments weren’t convincing, since otherwise they would have used the same arguments, even though they are different people.
I’m not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem). That doesn’t seem to me like meaningful evidence for the proposition “the arguments in Superintelligence are not sound”.
It’s been a while since I
readlistened to the audiobook version of Superintelligence, but I don’t recall the book arguing that the “second‐place AI lab” will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence. And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?I don’t recall the book relying on (or [EDIT: with a lot less confidence] even mentioning the possibility of) a discontinuity in capabilities. I believe it does argue that once there are AI systems that can do anything humans can, we can expect extremely fast progress.
I would call the inner alignment problem a refinement of the traditional argument from AI risk. The traditional argument was that there was going to be a powerful system that had a utility function it was maximizing and it might not match ours. Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
If all the new arguments were mere refinements of the old ones, then my argument would not work. I don’t think that all the new ones are refinements of the old ones, however. For an example, try to map what failure looks like onto Nick Bostrom’s model for AI risk. Influence-seeking sorta looks like what Nick Bostrom was talking about, but I don’t think “Going out with a whimper” is what he had in mind (I haven’t read the book in a while though).
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes, where a single team gains a decisive strategic advantage over the rest of the world (which seems impossible unless a single team surges forward in development). Robin Hanson had the same critique in his review of the book.
If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
It’s possible that we don’t have good arguments yet, but good arguments could present themselves eventually and it would be too late at that point to go back in time and ask people in the past to start work on AI safety. I agree with this heuristic (though it’s weak, and should only be used if there are not other more pressing existential risks to work on).
I also agree that there are conceptual arguments for why we should start AI safety work now, and I’m not totally convinced that the future will be either kind or safe to humanity. It’s worth understanding the arguments for and against AI safety, lest we treat it as a team to be argued for.
As I understand the language, the “loss function used to train our AIs” matches “our objective function” from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument” (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?
The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don’t get a second chance.
The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.
I don’t think there’s a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.
I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.
In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?
Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.
You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.
And since there’s no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.
This framing really helped me think about gradual self-improvement, thanks for writing it down!
I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there’s some sense of intent that hasn’t been explicitly happening for the past thousand years.
Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban’s framing) that somehow has some agency. But it’s much less effective at self-improving itself, and it’s not thinking “yep, I need to invent this new science to optimize this utility function”. I agree that the threshold is “when all the relevant action is from a single system improving itself”.
And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe “single system improving itself”? How will all the relevant actors behave in the meantime?
My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.
I’m curious if you’ve seen The Main Sources of AI Risk? Have you considered all of those sources/kinds of risks and still think that the total AI-related x-risk is not very large?
[ETA: It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I’m pessimistic in a sense, since I think by default our future civilization’s values will be quite different from mine in important ways.]
My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it’s difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think
We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
The way our values might go extinct will probably happen in some unavoidable manner that’s not related to the typical sources of AI risk. In other words, it’s likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.
If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I’m skeptical of very short timelines.
Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.
May I beseech you to be more careful about using “optimism” and words like it in the future, because I’m really worried about strategy researchers and decision makers getting the wrong impression from AI safety researchers about how hard the overall AI risk problem is, and for some reason I keep seeing people say that they’re “optimistic” (or other words to that effect) when they mean optimistic about some sub-problem of AI risk instead of AI risk as a whole, but they don’t make that clear. In many cases it’s pretty predictable that people outside technical AI safety research (or even inside, like in this case) would often misinterpret that as being optimistic about AI risk.
Note that lack of ability to know what alignment work would be useful to do ahead of time increases, rather than decreases, the absolute level of risk; thus, it increases rather than decreases the risk metrics (e.g. probability of humans being wiped out) that FHI estimated.
It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.
Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)
I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn’t put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.
In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).
A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don’t understand a whole lot about those architectures, this limits our abilities to do concrete work. I don’t find the argument entirely compelling because,
It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
We can still start institutional work, and develop important theoretical insights.
But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.
I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.
Do you think AI alignment researchers have not taken this into consideration already? For example, I’m pretty sure I’ve read arguments from Paul Christiano for why he is working on his approach even though we don’t know how AGI will be built. MIRI people have made such arguments too, I think.
I’m not claiming any sort of knock-down argument. I understand that individual researchers often have very thoughtful reasons for thinking that their approach will work. I just take the heuristic seriously that it is very difficult to predict the future, or to change the course of history in a predictable way. My understanding of past predictions of the future is that they have been more-or-less horrible, and so skepticism of any particular line of research is pretty much always warranted.
In case you think AI alignment researchers are unusually good at predicting the future, and you would put them in a different reference class, I will point out that the type of AI risk stuff people on Lesswrong talk about now is different in meaningful ways to the stuff that was talked about five or ten years ago on here.
To demonstrate, a common assumption was that in the absence of advanced AI architecture design, we could minimally assume that an AI would maximize a utility function, since a utility function is a useful abstraction that seems robust to architectural changes in our underlying AI designs or future insights. The last few years has seen many people here either rejecting this argument, or finding it to be vacuous, or underspecified as an argument. (I’m not taking a hard position, I’m merely pointing out that this shift has occurred).
People also assumed that, in the absence of advanced AI architecture design, we could assume that an AI’s first priority would be to increase it’s own intelligence, prompting researchers to study stable recursive self-improvement. Again, the last few years has seen people here rejecting this argument, or concluding that it’s not a priority for research. (Once again, I’m not here to argue whether this specific shift was entirely justified).
I suspect that even very reasonable sounding arguments of the type, “Well, we might not know what AI will look like, but mimimally we can assume X, and X is a tractable line of research” will turn out to be suspicious in the end. That’s not to say that some of these arguments won’t be correct. Perhaps, if we’re very carfeul, we can find out which ones are correct. I just have a strong heuristic of assuming future cluelessness.
When you say “the last few years has seen many people here” for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?
I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don’t remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.
In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven’t fully catched your opinion on that.
For the utility of talking about utility functions, see this rebuttal of an argument justifying the use of utility functions by appealing to the VNM-utility theorem, and a few more posts expanding the discussion. The CAIS paper argues that we shouldn’t model future AI as having monolithic long-term utility function. But it’s by no means a settled debate.
For the rejection of stable self improvement as a research priority, Paul Christiano wrote a post in 2014 where he argued that stable recursive self improvement will be solved a special case of reasoning under uncertainty. And again, the CAIS model proposes that technological progress will feed into itself (not unlike what already happens), rather than a monolithic agent improving itself.
I get the impression that very few people outside of MIRI work on studying stable recursive self improvement, though this might be because they think it’s not their comparative advantage.
There’s a difference between accepting something as a theoretical problem, and accepting that it’s a tractable research priority. I was arguing that the type of work we do right now might not be useful for future researchers, and so I wasn’t trying to say that these things didn’t exist. Rather, it’s not clear that productive work can be done on them right now. My evidence was that the way we think about these problems has changed over the years. Of course, you could say that the reason why the research focuses shifted is because we made progress, but I’d be skeptical about that hypothesis.
I don’t quite understand the question? It’s my understanding that I was disputing a notion that the inner alignment should count as a “shift in arguments” for AI risk. I claimed that it was a refinement of the traditional arguments; more specifically, we decomposed the value alignment problem into two levels. I’m quite confused at what I’m missing here.
Thanks for all the references! I don’t currently have much time to read all of it right now so I can’t really engage with the specific arguments for the rejection of using utility functions/studying recursive self-improvement.
I essentially agree with most of what you wrote. There is maybe a slight disagreement in how you framed (not what you meant) how research focus shifted since 2014.
I see Superintelligence as essentially saying “hey, there is pb A. And even if we solve A, then we might also have B. And given C and D, there might be E.” Now that the field is more mature and we have many more researchers getting paid to work on these problems, the arguments became much more goal focused. Now people are saying “I’m going to make progress on sub-problem X, by publishing a paper on Y. And working on Z is not cost-effective given so I’m not going to work on it given humanity’s current time constraints.”
These approaches are often grouped as “long-term problems-focused” and “making tractable progress now focused”. In the first group you have Yudkowsky 2010, Bostrom 2014, MIRI’s current research and maybe CAIS. In the second one you have current CHAI/FHI/OpenAI/DeepMind/Ought papers.
Your original framing can be interpreted as “after proving some mathematical theorems, people rejected the main arguments of Superintelligence and now most of the community agrees that working on X, Y and Z are tractable but A, B and C are more controversials”.
I think a more nuanced and precise framing would be: “In Superintelligence Bostrom exposes exhaustively the risks associated with advanced AI. A short portion of the book is dedicated to the problems are working on right now. Indeed, people stopped working on the other problems (largest portion of the book) because 1) there hasn’t been really productive working on them 2) some rebuttals have been written online giving convincing arguments that those pbs are not tractable anyway 3) there are now well-funded research organizations with incentives to make tangible progress on those pbs.”
In your last framing, you presented precise papers/rebuttals (thanks again!) for 2), and I think rebuttals are a great reason to stop working on a pb, but I think they’re not the only reason and not the real reason people stopped working on those pb. To be fair, I think 1) can be explained by many more factors than “it’s theoretically impossible to make progress on those pbs”. It can be that the research mindset required to work on these pbs is less socially/intellectually validating or requires much more theoretical approaches, so will be off-putting/tiresome to most recent grads that enter the field. I also think that AI Safety is now much more intertwined with evidence-based approaches such as Effective Altruism than it was in 2014, which explains 3), so people start presenting their research as “partial solutions to the pb. of AI Safety” or “research agenda”.
To be clear, I’m not criticizing the current shift in research. I think it’s productive for the field, both in the short term and long term. To give a bit more personal context, I started getting interested in AI Safety after reading Bostrom and have always been more interested in the “finding problems” approach. I went to FHI to work on AI Safety because I was super interested in finding new pbs related to the treacherous turn. It’s now almost taboo to say that we’re working on pbs that are sub-optimally minimizing AI risk, but the real reason that pushed me to think about those pbs was because they were both important and interesting. The pb. with the current “shift in framing” is that it’s making it socially unacceptable for people to think/work on more long-term pbs where there is more variance in research productivity.
Sorry about that. I thought there was some link to our discussion about utility functions but I misunderstood.
EDIT: I also wanted to mention that the number of pages in a book doesn’t account for how important the author think the pb. is (Bostrom even comments on this in the postface of its book). Again, the book is mostly about saying “here are all the pbs”, not “these are the tractable pbs we should start working on, and we should dedicate research ressources proportionally to the amount of pages I talk about it in the book”.
I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field’s understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don’t think the shift in arguments at all justifies the conclusion that prior work wasn’t very helpful, as the prior work could have been necessary to achieve that very shift.
I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It’s possible that by that time it would be “too late” as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I’m pretty skeptical of this though).