The consensus among alignment researchers is that if AGI were developed right now it would be almost certainly a negative. We simply don’t know how one would ensure a superintelligence was benevolent yet, even theoretically. The argument is more convincing if you agree with that assessment, because the only way to get benevolent AI becomes to either delay the creation of AGI until we do have that understanding or hope that the understanding arrives in time.
The argument also becomes more convincing if you agree with the assessment that advancements toward AGI aren’t going to be driven mostly by moore’s law and is instead going to be concentrated in a few top research and development companies—DeepMind, Facebook AI labs, etc. That’s my opinion and it’s also one I think is quite reasonable. Moore’s law is slowing down. It’s impossible for someone like me to predict how exactly AGI will be developed, but when I look at the advancements in AGI-adjacent-capabilities-research in the last ten years, it seems like the big wins have been in research and willingness to spend from the big players, not increased GPU power. It’s not like we know of some algorithm right now which we just need 3 OOMs more compute for, that would give us AGI. The exception to that would maybe be full brain emulation, which obviously comes with reduced risk.
I don’t see anything in the linked survey about a consensus view on total existential risk probability from AGI. The survey asked researchers to compare between different existential catastrophe scenarios, not about their total x-risk probability, and surely not about the probability of x-risk if AGI were developed now without further alignment research.
We asked researchers to estimate the probability of five AI risk scenarios, conditional on an existential catastrophe due to AI having occurred. There was also a catch-all “other scenarios” option.
Most of this community’s discussion about existential risk from AI focuses on scenarios involving one or more powerful, misaligned AI systems that take control of the future. This kind of concern is articulated most prominently in “Superintelligence” and “What failure looks like”, corresponding to three scenarios in our survey (the “Superintelligence” scenario, part 1 and part 2 of “What failure looks like”). The median respondent’s total (conditional) probability on these three scenarios was 50%, suggesting that this kind of concern about AI risk is still prevalent, but far from the only kind of risk that researchers are concerned about today.
We simply don’t know how one would ensure a superintelligence was benevolent yet, even theoretically.
We also don’t know what an actual superintelligence would look like; it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
For the nuclear analogy, you couldn’t design a safe nuclear power plant before you understood what nuclear fission and radioactivity were in the first place. As an another example, InstructGPT is arguably a more “aligned” version of GPT-3, but it seems unlikely that anyone could have figured out how to better align language models before language models were invented.
it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don’t understand and so whose behavior/properties you can’t verify to be acceptable. It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).
It’s possible, also, that this is about takeoff speeds, and that you think its plausible that e.g. we can disincentivize deception by punishing the negative consequences it entails (if FOOM, can’t since we’d be dead).
It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it
Or maybe once our understanding of intelligent computation in general improves, it will also give us the tools for better identifying deceptive computation.
E.g. language models are already “deceptive” in a sense—asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I shared that, multiple people pointed out that its answers sound like the kind of a student who’s taking an exam and is asked to write an essay about a topic they know nothing about, but they try to fake their way through anyway (that is, they are trying to deceive the examiner). Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT “trying to deceive” people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such “deceptive” computations as a capabilities researcher already.
This means that the existence of InstructGPT gives you both 1) a concrete financial incentive to do research for identifying and stopping deceptive computation 2) a real system that actually carries out something like deceptive computation, which you can experiment on and whose behavior you can make use of in trying to understand the phenomenon better. That second point is something that wouldn’t have been the case before our capabilities got to this point. And it might allow us to figure out something we wouldn’t have thought of before we had a system with this capability level to tinker with.
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]
Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you what the truth actually seems to be (given what it knows). The reason is that, as AI develops, programs that are capable of the former thing have constant complexity, but programs that are capable of the latter thing have complexity that grows with the complexity of the AI’s models of the world, and so you should expect that the former is favored by SGD. See this part of the ELK document for a more detailed description of this failure mode.
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI.
What sort of thing? I didn’t mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what’s going on with it.
More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn’t have been impossible to figure out with our current understanding, and I’d be undermining my whole argument. :-)
Rather, my example was meant to demonstrate that it has already happened that
Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.
Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won’t).
It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?
The argument also becomes more convincing if you agree with the assessment that advancements toward AGI aren’t going to be driven mostly by moore’s law and is instead going to be concentrated in a few top research and development companies—DeepMind, Facebook AI labs, etc.
The consensus among alignment researchers is that if AGI were developed right now it would be almost certainly a negative.
Is that true? I thought that I had read Yudkowsky estimating that the probability of an AGI being unfriendly was 30% and that he was working to bring that 30% to 0%. If alignment researchers are convinced that this is more like 90+%, I agree that the argument becomes much more convincing.
I agree that these two questions are the cruxes in our positions.
I thought that I had read Yudkowsky estimating that the probability of an AGI being unfriendly was 30% and that he was working to bring that 30% to 0%.
Also, look at his bet with Bryan Caplan. He’s not joking.
And, also, Jesus, Everyone! Gradient Descent, is just, like, a deadly architecture. When I think about current architectures, they make Azathoth look smart and cuddly. There’s nothing friendly in there, even if we can get cool stuff out right now.
I don’t even know anymore what it is like to not see it this way. Does anyone have a good defense that current ML techniques can be stopped from having a deadly range of action?
Probably not; Eliezer addressed this in Q6 of the post, and while it’s a little ambiguous, I think Eliezer’s interactions with people who overwhelmingly took it seriously basically prove that it was serious; see in particular this interaction.
(But can we not downvote everyone into oblivion just for drawing the obvious conclusion without checking?)
Is that true? I thought that I had read Yudkowsky estimating that the probability of an AGI being unfriendly was 30% and that he was working to bring that 30% to 0%. If alignment researchers are convinced that this is more like 90+%, I agree that the argument becomes much more convincing.
I am not sure if he’s given another number explicitly, but I’m almost positive that Yudkowsky does not believe that. The probability that an AGI will be end up being aligned “by default” is epsilon. Maybe he said at one point that there was a 30% chance that AGI will be what destroys the world if it’s developed, given alignment efforts, but that doesn’t sound to me like him either.
You should read the most recent post he made on the subject; it’s extraordinarily pessimistic about our future. He mentions multiple times that he thinks the probability of success here need to be measured in log-odds. He very sarcastically uses april fools at the end as a sort of ambiguity shield, but I don’t think anybody believes he isn’t being serious.
I’m not convinced that the odds mentioned in that post are meant to be taken literally, given it being an April Fools post, as opposed to just metaphorically and pointed in a direction.
He does also mention in that post that in the past he thought the odds were 50%, so perhaps I’m just remembering an old post from sometime between the 50% days and the epsilon days.
The most optimistic view I’ve heard recently is Vanessa Kosoy claiming 30% chance of pulling it off. Not sure where consensus would be, but I read MIRI as ‘almost certain doom’. And I can’t speak for Eliezer, but if he ever thought that there was ever any hope that AGI might be aligned ‘by chance’, that thought is well concealed in everything he’s written for the last 15 years.
What he did once think was that it might be possible, with heroic effort, to solve the alignment problem.
There is no reason why my personal opinion should matter to you, but it is: “We are fucked beyond hope. There is no way out. The only question is when.”
I don’t know what his earliest writing may have said, but his writing in the past few years has definitely not assigned anywhere near as high a probability as 70% to friendly AI.
Even if he had, and it was true, do you think a 30% chance of killing every human in existence (and possibly all life in the future universe) is in any way a sane risk to take? Is it even sane at 1%?
I personally don’t think advancing a course of action that has even an estimated 1% chance of permanent extinction is sane. While I have been interested in artificial intelligence for decades and even started my PhD study in the field, I left it long ago and have quite deliberately not attempted to advance it in any way. If I could plausibly hinder further research, I would.
Even alignment research seems akin to theorizing a complicated way of poking a sleeping dragon-god prophesied to eat the world, in such a manner that it will wake up friendly instead. Rather than just not poking it at all and making sure that nobody else does either, regardless of how tempting the wealth in its hoard might be.
Even many of the comparatively good outcomes in which superintelligent AI faithfully serves human goals seem likely to be terrible in practice.
It’s worth it to poke the dragon with a stick if you have only a 28% chance of making it destroy the world while the person who’s planning to poke it tomorrow has a 30% chance. If we can prevent those people in a different way then great, but I’m not convinced that we can.
It doesn’t help at all in the case where the research you’re doing makes it significantly more likely that they will be equipped with stronger sticks and have greater confidence in poking the dragon tomorrow.
The consensus among alignment researchers is that if AGI were developed right now it would be almost certainly a negative. We simply don’t know how one would ensure a superintelligence was benevolent yet, even theoretically. The argument is more convincing if you agree with that assessment, because the only way to get benevolent AI becomes to either delay the creation of AGI until we do have that understanding or hope that the understanding arrives in time.
The argument also becomes more convincing if you agree with the assessment that advancements toward AGI aren’t going to be driven mostly by moore’s law and is instead going to be concentrated in a few top research and development companies—DeepMind, Facebook AI labs, etc. That’s my opinion and it’s also one I think is quite reasonable. Moore’s law is slowing down. It’s impossible for someone like me to predict how exactly AGI will be developed, but when I look at the advancements in AGI-adjacent-capabilities-research in the last ten years, it seems like the big wins have been in research and willingness to spend from the big players, not increased GPU power. It’s not like we know of some algorithm right now which we just need 3 OOMs more compute for, that would give us AGI. The exception to that would maybe be full brain emulation, which obviously comes with reduced risk.
This isn’t true. [ETA: I linked the wrong survey before.]
I don’t see anything in the linked survey about a consensus view on total existential risk probability from AGI. The survey asked researchers to compare between different existential catastrophe scenarios, not about their total x-risk probability, and surely not about the probability of x-risk if AGI were developed now without further alignment research.
Maybe Carl meant to link this one
You’re right, my link was wrong, that one is a fine link.
You’re right, I linked the wrong survey!
We also don’t know what an actual superintelligence would look like; it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
For the nuclear analogy, you couldn’t design a safe nuclear power plant before you understood what nuclear fission and radioactivity were in the first place. As an another example, InstructGPT is arguably a more “aligned” version of GPT-3, but it seems unlikely that anyone could have figured out how to better align language models before language models were invented.
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don’t understand and so whose behavior/properties you can’t verify to be acceptable. It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).
It’s possible, also, that this is about takeoff speeds, and that you think its plausible that e.g. we can disincentivize deception by punishing the negative consequences it entails (if FOOM, can’t since we’d be dead).
Or maybe once our understanding of intelligent computation in general improves, it will also give us the tools for better identifying deceptive computation.
E.g. language models are already “deceptive” in a sense—asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I shared that, multiple people pointed out that its answers sound like the kind of a student who’s taking an exam and is asked to write an essay about a topic they know nothing about, but they try to fake their way through anyway (that is, they are trying to deceive the examiner). Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT “trying to deceive” people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such “deceptive” computations as a capabilities researcher already.
This means that the existence of InstructGPT gives you both 1) a concrete financial incentive to do research for identifying and stopping deceptive computation 2) a real system that actually carries out something like deceptive computation, which you can experiment on and whose behavior you can make use of in trying to understand the phenomenon better. That second point is something that wouldn’t have been the case before our capabilities got to this point. And it might allow us to figure out something we wouldn’t have thought of before we had a system with this capability level to tinker with.
[ETA: I’m not that sure of the below argument]
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]
Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you what the truth actually seems to be (given what it knows). The reason is that, as AI develops, programs that are capable of the former thing have constant complexity, but programs that are capable of the latter thing have complexity that grows with the complexity of the AI’s models of the world, and so you should expect that the former is favored by SGD. See this part of the ELK document for a more detailed description of this failure mode.
What sort of thing? I didn’t mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what’s going on with it.
More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn’t have been impossible to figure out with our current understanding, and I’d be undermining my whole argument. :-)
Rather, my example was meant to demonstrate that it has already happened that
Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.
Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won’t).
It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?
I notice you didn’t mention EleutherAI.
Is that true? I thought that I had read Yudkowsky estimating that the probability of an AGI being unfriendly was 30% and that he was working to bring that 30% to 0%. If alignment researchers are convinced that this is more like 90+%, I agree that the argument becomes much more convincing.
I agree that these two questions are the cruxes in our positions.
That’s not Yudkowsky’s current position. https://www.lesswrong.com/posts/j9Q8bRmwCgXRYAgcJ/miri-announces-new-death-with-dignity-strategy describes the current view and in the comments, you see the views of other people at MIRI.
Yudkoskwy is at 99+% that AGI right now would kill humanity.
April Fools!
Also, look at his bet with Bryan Caplan. He’s not joking.
And, also, Jesus, Everyone! Gradient Descent, is just, like, a deadly architecture. When I think about current architectures, they make Azathoth look smart and cuddly. There’s nothing friendly in there, even if we can get cool stuff out right now.
I don’t even know anymore what it is like to not see it this way. Does anyone have a good defense that current ML techniques can be stopped from having a deadly range of action?
Probably not; Eliezer addressed this in Q6 of the post, and while it’s a little ambiguous, I think Eliezer’s interactions with people who overwhelmingly took it seriously basically prove that it was serious; see in particular this interaction.
(But can we not downvote everyone into oblivion just for drawing the obvious conclusion without checking?)
I first heard Eliezer describe “dying with dignity” as a strategy in October 2021. I’m pretty sure he really means it.
I am not sure if he’s given another number explicitly, but I’m almost positive that Yudkowsky does not believe that. The probability that an AGI will be end up being aligned “by default” is epsilon. Maybe he said at one point that there was a 30% chance that AGI will be what destroys the world if it’s developed, given alignment efforts, but that doesn’t sound to me like him either.
You should read the most recent post he made on the subject; it’s extraordinarily pessimistic about our future. He mentions multiple times that he thinks the probability of success here need to be measured in log-odds. He very sarcastically uses april fools at the end as a sort of ambiguity shield, but I don’t think anybody believes he isn’t being serious.
I’m not convinced that the odds mentioned in that post are meant to be taken literally, given it being an April Fools post, as opposed to just metaphorically and pointed in a direction.
He does also mention in that post that in the past he thought the odds were 50%, so perhaps I’m just remembering an old post from sometime between the 50% days and the epsilon days.
The most optimistic view I’ve heard recently is Vanessa Kosoy claiming 30% chance of pulling it off. Not sure where consensus would be, but I read MIRI as ‘almost certain doom’. And I can’t speak for Eliezer, but if he ever thought that there was ever any hope that AGI might be aligned ‘by chance’, that thought is well concealed in everything he’s written for the last 15 years.
What he did once think was that it might be possible, with heroic effort, to solve the alignment problem.
There is no reason why my personal opinion should matter to you, but it is: “We are fucked beyond hope. There is no way out. The only question is when.”
I don’t know what his earliest writing may have said, but his writing in the past few years has definitely not assigned anywhere near as high a probability as 70% to friendly AI.
Even if he had, and it was true, do you think a 30% chance of killing every human in existence (and possibly all life in the future universe) is in any way a sane risk to take? Is it even sane at 1%?
I personally don’t think advancing a course of action that has even an estimated 1% chance of permanent extinction is sane. While I have been interested in artificial intelligence for decades and even started my PhD study in the field, I left it long ago and have quite deliberately not attempted to advance it in any way. If I could plausibly hinder further research, I would.
Even alignment research seems akin to theorizing a complicated way of poking a sleeping dragon-god prophesied to eat the world, in such a manner that it will wake up friendly instead. Rather than just not poking it at all and making sure that nobody else does either, regardless of how tempting the wealth in its hoard might be.
Even many of the comparatively good outcomes in which superintelligent AI faithfully serves human goals seem likely to be terrible in practice.
It’s worth it to poke the dragon with a stick if you have only a 28% chance of making it destroy the world while the person who’s planning to poke it tomorrow has a 30% chance. If we can prevent those people in a different way then great, but I’m not convinced that we can.
It doesn’t help at all in the case where the research you’re doing makes it significantly more likely that they will be equipped with stronger sticks and have greater confidence in poking the dragon tomorrow.