[ETA: It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I’m pessimistic in a sense, since I think by default our future civilization’s values will be quite different from mine in important ways.]
My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it’s difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think
We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
The way our values might go extinct will probably happen in some unavoidable manner that’s not related to the typical sources of AI risk. In other words, it’s likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.
If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I’m skeptical of very short timelines.
Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.
It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling.
May I beseech you to be more careful about using “optimism” and words like it in the future, because I’m really worried about strategy researchers and decision makers getting the wrong impression from AI safety researchers about how hard the overall AI risk problem is, and for some reason I keep seeing people say that they’re “optimistic” (or other words to that effect) when they mean optimistic about some sub-problem of AI risk instead of AI risk as a whole, but they don’t make that clear. In many cases it’s pretty predictable that people outside technical AI safety research (or even inside, like in this case) would often misinterpret that as being optimistic about AI risk.
Note that lack of ability to know what alignment work would be useful to do ahead of time increases, rather than decreases, the absolute level of risk; thus, it increases rather than decreases the risk metrics (e.g. probability of humans being wiped out) that FHI estimated.
It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.
The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.
It seems even harder to do productive work, since I’m skeptical of very short timelines.
Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.
I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn’t put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.
In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).
Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety?
A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don’t understand a whole lot about those architectures, this limits our abilities to do concrete work. I don’t find the argument entirely compelling because,
It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
We can still start institutional work, and develop important theoretical insights.
But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.
And it seems to me like “probably helps somewhat” is enough when it comes to existential risk
I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.
It follows that most work during this pre-AGI period might miss important details and be less effective than we think.
Do you think AI alignment researchers have not taken this into consideration already? For example, I’m pretty sure I’ve read arguments from Paul Christiano for why he is working on his approach even though we don’t know how AGI will be built. MIRI people have made such arguments too, I think.
I’m not claiming any sort of knock-down argument. I understand that individual researchers often have very thoughtful reasons for thinking that their approach will work. I just take the heuristic seriously that it is very difficult to predict the future, or to change the course of history in a predictable way. My understanding of past predictions of the future is that they have been more-or-less horrible, and so skepticism of any particular line of research is pretty much always warranted.
In case you think AI alignment researchers are unusually good at predicting the future, and you would put them in a different reference class, I will point out that the type of AI risk stuff people on Lesswrong talk about now is different in meaningful ways to the stuff that was talked about five or ten years ago on here.
To demonstrate, a common assumption was that in the absence of advanced AI architecture design, we could minimally assume that an AI would maximize a utility function, since a utility function is a useful abstraction that seems robust to architectural changes in our underlying AI designs or future insights. The last few years has seen many people here either rejecting this argument, or finding it to be vacuous, or underspecified as an argument. (I’m not taking a hard position, I’m merely pointing out that this shift has occurred).
People also assumed that, in the absence of advanced AI architecture design, we could assume that an AI’s first priority would be to increase it’s own intelligence, prompting researchers to study stable recursive self-improvement. Again, the last few years has seen people here rejecting this argument, or concluding that it’s not a priority for research. (Once again, I’m not here to argue whether this specific shift was entirely justified).
I suspect that even very reasonable sounding arguments of the type, “Well, we might not know what AI will look like, but mimimally we can assume X, and X is a tractable line of research” will turn out to be suspicious in the end. That’s not to say that some of these arguments won’t be correct. Perhaps, if we’re very carfeul, we can find out which ones are correct. I just have a strong heuristic of assuming future cluelessness.
When you say “the last few years has seen many people here” for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?
I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don’t remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.
In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven’t fully catched your opinion on that.
When you say “the last few years has seen many people here” for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?
For the utility of talking about utility functions, see this rebuttal of an argument justifying the use of utility functions by appealing to the VNM-utility theorem, and a fewmoreposts expanding the discussion. The CAIS paper argues that we shouldn’t model future AI as having monolithic long-term utility function. But it’s by no means a settled debate.
For the rejection of stable self improvement as a research priority, Paul Christiano wrote a post in 2014 where he argued that stable recursive self improvement will be solved a special case of reasoning under uncertainty. And again, the CAIS model proposes that technological progress will feed into itself (not unlike what already happens), rather than a monolithic agent improving itself.
I get the impression that very few people outside of MIRI work on studying stable recursive self improvement, though this might be because they think it’s not their comparative advantage.
I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don’t remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.
There’s a difference between accepting something as a theoretical problem, and accepting that it’s a tractable research priority. I was arguing that the type of work we do right now might not be useful for future researchers, and so I wasn’t trying to say that these things didn’t exist. Rather, it’s not clear that productive work can be done on them right now. My evidence was that the way we think about these problems has changed over the years. Of course, you could say that the reason why the research focuses shifted is because we made progress, but I’d be skeptical about that hypothesis.
In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven’t fully catched your opinion on that.
I don’t quite understand the question? It’s my understanding that I was disputing a notion that the inner alignment should count as a “shift in arguments” for AI risk. I claimed that it was a refinement of the traditional arguments; more specifically, we decomposed the value alignment problem into two levels. I’m quite confused at what I’m missing here.
Thanks for all the references! I don’t currently have much time to read all of it right now so I can’t really engage with the specific arguments for the rejection of using utility functions/studying recursive self-improvement.
I essentially agree with most of what you wrote. There is maybe a slight disagreement in how you framed (not what you meant) how research focus shifted since 2014.
I see Superintelligence as essentially saying “hey, there is pb A. And even if we solve A, then we might also have B. And given C and D, there might be E.” Now that the field is more mature and we have many more researchers getting paid to work on these problems, the arguments became much more goal focused. Now people are saying “I’m going to make progress on sub-problem X, by publishing a paper on Y. And working on Z is not cost-effective given so I’m not going to work on it given humanity’s current time constraints.”
These approaches are often grouped as “long-term problems-focused” and “making tractable progress now focused”. In the first group you have Yudkowsky 2010, Bostrom 2014, MIRI’s current research and maybe CAIS. In the second one you have current CHAI/FHI/OpenAI/DeepMind/Ought papers.
Your original framing can be interpreted as “after proving some mathematical theorems, people rejected the main arguments of Superintelligence and now most of the community agrees that working on X, Y and Z are tractable but A, B and C are more controversials”.
I think a more nuanced and precise framing would be: “In Superintelligence Bostrom exposes exhaustively the risks associated with advanced AI. A short portion of the book is dedicated to the problems are working on right now. Indeed, people stopped working on the other problems (largest portion of the book) because 1) there hasn’t been really productive working on them 2) some rebuttals have been written online giving convincing arguments that those pbs are not tractable anyway 3) there are now well-funded research organizations with incentives to make tangible progress on those pbs.”
In your last framing, you presented precise papers/rebuttals (thanks again!) for 2), and I think rebuttals are a great reason to stop working on a pb, but I think they’re not the only reason and not the real reason people stopped working on those pb. To be fair, I think 1) can be explained by many more factors than “it’s theoretically impossible to make progress on those pbs”. It can be that the research mindset required to work on these pbs is less socially/intellectually validating or requires much more theoretical approaches, so will be off-putting/tiresome to most recent grads that enter the field. I also think that AI Safety is now much more intertwined with evidence-based approaches such as Effective Altruism than it was in 2014, which explains 3), so people start presenting their research as “partial solutions to the pb. of AI Safety” or “research agenda”.
To be clear, I’m not criticizing the current shift in research. I think it’s productive for the field, both in the short term and long term. To give a bit more personal context, I started getting interested in AI Safety after reading Bostrom and have always been more interested in the “finding problems” approach. I went to FHI to work on AI Safety because I was super interested in finding new pbs related to the treacherous turn. It’s now almost taboo to say that we’re working on pbs that are sub-optimally minimizing AI risk, but the real reason that pushed me to think about those pbs was because they were both important and interesting. The pb. with the current “shift in framing” is that it’s making it socially unacceptable for people to think/work on more long-term pbs where there is more variance in research productivity.
I don’t quite understand the question?
Sorry about that. I thought there was some link to our discussion about utility functions but I misunderstood.
EDIT: I also wanted to mention that the number of pages in a book doesn’t account for how important the author think the pb. is (Bostrom even comments on this in the postface of its book). Again, the book is mostly about saying “here are all the pbs”, not “these are the tractable pbs we should start working on, and we should dedicate research ressources proportionally to the amount of pages I talk about it in the book”.
I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field’s understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don’t think the shift in arguments at all justifies the conclusion that prior work wasn’t very helpful, as the prior work could have been necessary to achieve that very shift.
I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It’s possible that by that time it would be “too late” as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I’m pretty skeptical of this though).
[ETA: It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I’m pessimistic in a sense, since I think by default our future civilization’s values will be quite different from mine in important ways.]
My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it’s difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think
We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
The way our values might go extinct will probably happen in some unavoidable manner that’s not related to the typical sources of AI risk. In other words, it’s likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.
If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I’m skeptical of very short timelines.
Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.
May I beseech you to be more careful about using “optimism” and words like it in the future, because I’m really worried about strategy researchers and decision makers getting the wrong impression from AI safety researchers about how hard the overall AI risk problem is, and for some reason I keep seeing people say that they’re “optimistic” (or other words to that effect) when they mean optimistic about some sub-problem of AI risk instead of AI risk as a whole, but they don’t make that clear. In many cases it’s pretty predictable that people outside technical AI safety research (or even inside, like in this case) would often misinterpret that as being optimistic about AI risk.
Note that lack of ability to know what alignment work would be useful to do ahead of time increases, rather than decreases, the absolute level of risk; thus, it increases rather than decreases the risk metrics (e.g. probability of humans being wiped out) that FHI estimated.
It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.
Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)
I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn’t put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.
In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).
A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don’t understand a whole lot about those architectures, this limits our abilities to do concrete work. I don’t find the argument entirely compelling because,
It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
We can still start institutional work, and develop important theoretical insights.
But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.
I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.
Do you think AI alignment researchers have not taken this into consideration already? For example, I’m pretty sure I’ve read arguments from Paul Christiano for why he is working on his approach even though we don’t know how AGI will be built. MIRI people have made such arguments too, I think.
I’m not claiming any sort of knock-down argument. I understand that individual researchers often have very thoughtful reasons for thinking that their approach will work. I just take the heuristic seriously that it is very difficult to predict the future, or to change the course of history in a predictable way. My understanding of past predictions of the future is that they have been more-or-less horrible, and so skepticism of any particular line of research is pretty much always warranted.
In case you think AI alignment researchers are unusually good at predicting the future, and you would put them in a different reference class, I will point out that the type of AI risk stuff people on Lesswrong talk about now is different in meaningful ways to the stuff that was talked about five or ten years ago on here.
To demonstrate, a common assumption was that in the absence of advanced AI architecture design, we could minimally assume that an AI would maximize a utility function, since a utility function is a useful abstraction that seems robust to architectural changes in our underlying AI designs or future insights. The last few years has seen many people here either rejecting this argument, or finding it to be vacuous, or underspecified as an argument. (I’m not taking a hard position, I’m merely pointing out that this shift has occurred).
People also assumed that, in the absence of advanced AI architecture design, we could assume that an AI’s first priority would be to increase it’s own intelligence, prompting researchers to study stable recursive self-improvement. Again, the last few years has seen people here rejecting this argument, or concluding that it’s not a priority for research. (Once again, I’m not here to argue whether this specific shift was entirely justified).
I suspect that even very reasonable sounding arguments of the type, “Well, we might not know what AI will look like, but mimimally we can assume X, and X is a tractable line of research” will turn out to be suspicious in the end. That’s not to say that some of these arguments won’t be correct. Perhaps, if we’re very carfeul, we can find out which ones are correct. I just have a strong heuristic of assuming future cluelessness.
When you say “the last few years has seen many people here” for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?
I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don’t remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.
In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven’t fully catched your opinion on that.
For the utility of talking about utility functions, see this rebuttal of an argument justifying the use of utility functions by appealing to the VNM-utility theorem, and a few more posts expanding the discussion. The CAIS paper argues that we shouldn’t model future AI as having monolithic long-term utility function. But it’s by no means a settled debate.
For the rejection of stable self improvement as a research priority, Paul Christiano wrote a post in 2014 where he argued that stable recursive self improvement will be solved a special case of reasoning under uncertainty. And again, the CAIS model proposes that technological progress will feed into itself (not unlike what already happens), rather than a monolithic agent improving itself.
I get the impression that very few people outside of MIRI work on studying stable recursive self improvement, though this might be because they think it’s not their comparative advantage.
There’s a difference between accepting something as a theoretical problem, and accepting that it’s a tractable research priority. I was arguing that the type of work we do right now might not be useful for future researchers, and so I wasn’t trying to say that these things didn’t exist. Rather, it’s not clear that productive work can be done on them right now. My evidence was that the way we think about these problems has changed over the years. Of course, you could say that the reason why the research focuses shifted is because we made progress, but I’d be skeptical about that hypothesis.
I don’t quite understand the question? It’s my understanding that I was disputing a notion that the inner alignment should count as a “shift in arguments” for AI risk. I claimed that it was a refinement of the traditional arguments; more specifically, we decomposed the value alignment problem into two levels. I’m quite confused at what I’m missing here.
Thanks for all the references! I don’t currently have much time to read all of it right now so I can’t really engage with the specific arguments for the rejection of using utility functions/studying recursive self-improvement.
I essentially agree with most of what you wrote. There is maybe a slight disagreement in how you framed (not what you meant) how research focus shifted since 2014.
I see Superintelligence as essentially saying “hey, there is pb A. And even if we solve A, then we might also have B. And given C and D, there might be E.” Now that the field is more mature and we have many more researchers getting paid to work on these problems, the arguments became much more goal focused. Now people are saying “I’m going to make progress on sub-problem X, by publishing a paper on Y. And working on Z is not cost-effective given so I’m not going to work on it given humanity’s current time constraints.”
These approaches are often grouped as “long-term problems-focused” and “making tractable progress now focused”. In the first group you have Yudkowsky 2010, Bostrom 2014, MIRI’s current research and maybe CAIS. In the second one you have current CHAI/FHI/OpenAI/DeepMind/Ought papers.
Your original framing can be interpreted as “after proving some mathematical theorems, people rejected the main arguments of Superintelligence and now most of the community agrees that working on X, Y and Z are tractable but A, B and C are more controversials”.
I think a more nuanced and precise framing would be: “In Superintelligence Bostrom exposes exhaustively the risks associated with advanced AI. A short portion of the book is dedicated to the problems are working on right now. Indeed, people stopped working on the other problems (largest portion of the book) because 1) there hasn’t been really productive working on them 2) some rebuttals have been written online giving convincing arguments that those pbs are not tractable anyway 3) there are now well-funded research organizations with incentives to make tangible progress on those pbs.”
In your last framing, you presented precise papers/rebuttals (thanks again!) for 2), and I think rebuttals are a great reason to stop working on a pb, but I think they’re not the only reason and not the real reason people stopped working on those pb. To be fair, I think 1) can be explained by many more factors than “it’s theoretically impossible to make progress on those pbs”. It can be that the research mindset required to work on these pbs is less socially/intellectually validating or requires much more theoretical approaches, so will be off-putting/tiresome to most recent grads that enter the field. I also think that AI Safety is now much more intertwined with evidence-based approaches such as Effective Altruism than it was in 2014, which explains 3), so people start presenting their research as “partial solutions to the pb. of AI Safety” or “research agenda”.
To be clear, I’m not criticizing the current shift in research. I think it’s productive for the field, both in the short term and long term. To give a bit more personal context, I started getting interested in AI Safety after reading Bostrom and have always been more interested in the “finding problems” approach. I went to FHI to work on AI Safety because I was super interested in finding new pbs related to the treacherous turn. It’s now almost taboo to say that we’re working on pbs that are sub-optimally minimizing AI risk, but the real reason that pushed me to think about those pbs was because they were both important and interesting. The pb. with the current “shift in framing” is that it’s making it socially unacceptable for people to think/work on more long-term pbs where there is more variance in research productivity.
Sorry about that. I thought there was some link to our discussion about utility functions but I misunderstood.
EDIT: I also wanted to mention that the number of pages in a book doesn’t account for how important the author think the pb. is (Bostrom even comments on this in the postface of its book). Again, the book is mostly about saying “here are all the pbs”, not “these are the tractable pbs we should start working on, and we should dedicate research ressources proportionally to the amount of pages I talk about it in the book”.
I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field’s understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don’t think the shift in arguments at all justifies the conclusion that prior work wasn’t very helpful, as the prior work could have been necessary to achieve that very shift.
I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It’s possible that by that time it would be “too late” as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I’m pretty skeptical of this though).