Lots I disagree with here, so let’s go through the list.
There are no pivotal weak acts.
Strong disagree.
EY and I don’t seem to agree that “nuke every semiconductor fab” is a weakly pivotal act (since I think AI is hardware-limited and he thinks it is awaiting a clever algorithm). But I think even “build nanobots that melt every GPU” could be built using an AI that is aligned in the “less than 50% chance of murdering us all” sense. For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other’s work.
On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.
Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.
there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment
I am significantly more optimistic about explainable AI than EY.
There is no analogous truth about there being a simple core of alignment
Corrigibility is anti-natural to consequentialist reasoning
Roll to disbelief. Cooperation is a natural equilibrium in many games.
you can’t rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about
Sure you can. Just train an AI that “wants” to be honest. This probably means training an AI with the objective function “accurately predict reality” and then using it to do other things (like make paperclips) rather than training it with an objective function “make paperclips”.
Coordination schemes between superintelligences are not things that humans can participate in
I don’t think this is as relevant as EY does. Even if it’s true that unaugmented humans are basically irrelevant to an economy of superintelligent AIs, that doesn’t mean we can’t have a future where augmented or tool-AI assisted humans can have meaningful influence.
Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you’re playing them against each other
I believe there is an intermediate level of AI between “utterly useless” and “immediately solves the acausal trading problem and begins coordinating perfectly against humans”. This window may be rather wide.
What makes an air conditioner ‘magic’ from the perspective of say the thirteenth century, is that even if you correctly show them the design of the air conditioner in advance, they won’t be able to understand from seeing that design why the air comes out cold
I’m virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.
There’s a pattern that’s played out quite often, over all the times the Earth has spun around the Sun, in which some bright-eyed young scientist, young engineer, young entrepreneur, proceeds in full bright-eyed optimism to challenge some problem that turns out to be really quite difficult. Very often the cynical old veterans of the field try to warn them about this, and the bright-eyed youngsters don’t listen, because, like, who wants to hear about all that stuff, they want to go solve the problem!
There’s also a pattern where the venerable scientist is proven wrong by the young scientist too foolish to know what they are doing is impossible.
This situation you see when you look around you is not what a surviving world looks like
Currently Metaculus estimates 55% chance for “Will there be a positive transition to a world with radically smarter-than-human artificial intelligence?”. Admitted I would like this to be higher, but at the minimum this is what a world that “might survive” looks like. I have no particular reason to trust EY vs Metaculus.
I suspect EY and I both agree that if you take existing Reinforcement Learning Architectures, write down the best utility function humans can think of, and then turn the dial up to 11, bad things will happen. EY seems to believe this is a huge problem because of his belief that “there is no weak pivotal act”. I think this should be taken as a strong warning to not do that. Rather than scaling architectures that are inherently dangerous, we should focus on making use of architectures that are naturally safe. For example, EY and I both agree that GPT-N is likely to be safe. EY simply disagrees with the claim that it might be useful.
EY and I probably also agree that Facebook/Baidu do not have the world’s best interest at heart (and are not taking alignment seriously enough or at all). Hence it is important that people who care about Alignment gain a decisive lead over these efforts. To me, this logically means that people interested in Alignment should be doing more capabilities research. To EY, this means that alignment focused institutions need to be using more secrecy. I’m not utterly opposed to keeping pure-capabilities advancements secret, but if there is a significant overlap between capabilities and alignment, then we need to be publishing the alignment-relevant bits so that we can cooperate (and hopefully so that Facebook can incorporate them too).
And for completeness, here’s a bunch of specific claims by EY I agree with
AGI will not be upper-bounded by human ability or human learning speed. Things much smarter than human would be able to learn from less evidence than humans require
I think the people who thought this stopped thinking this after move 37. I hope.
A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure
Strongly agree.
Losing a conflict with a high-powered cognitive system looks at least as deadly as “everybody on the face of the Earth suddenly falls over dead within the same second”
Strongly agree.
We need to get alignment right on the ‘first critical try’
Strongly agree.
We can’t just “decide not to build AGI”
Strongly agree
Running AGIs doing something pivotal are not passively safe
Agree. But I don’t think this means they are totally unworkable either.
Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you
Agree. But I think the lesson here is “don’t use powerful AIs until you are sure they are aligned”.
Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level
Agree somewhat. But I don’t rule out that “cooperative” or “interesting” is a natural attractor.
Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously
Agree, conditional on our definition of fast. I think that within a year of training our first “smart human” AI, we can simulate “100 smart humans” using a similar compute budget. I don’t think Foom takes us from “human level AI” to “smarter than all humans AI” in a few minutes simply be rewriting code.
outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction
Agree. This is why I am skeptical of utility-functions in general as a method for aligning AI.
Human raters make systematic errors—regular, compactly describable, predictable errors
Duh.
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI.
I am really not very optimistic about CEV.
A powerful AI searches parts of the option space we don’t, and we can’t foresee all its options.
Yes.
This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents
Agree. But I don’t think you need to make an AI that imitates humans in order to make an AI that is useful. For example, Codex allows me to write code significantly (2-5x) faster, despite frequently making dumb mistakes.
The AI does not think like you do
Yes.
AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.
Mostly agree. I think there exist architectures of AI that can be boxed.
You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.
I think the best approach to funding AI safety is something like Fast Grants where we focus more on quantity than on “quality” since it is nearly impossible to identify who will succeed in advance.
For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other’s work.
That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can’t be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.
And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).
You’d need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them from coordinating against you and bypassing the review system. Which means having actual humans between the scientists, which even if it works must slow things down so much that the simulated scientists probably can’t even design the nanobots on time.
Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.
I guess this is true, but only because the individual scientist AI that you train is only human-level (so the training is safe), and then you amplify it to superhuman level with many copies. If you train a powerful AI directly then there must be such a distributional shift (unless you just don’t care about making the training safe, in which case you die during the training).
Roll to disbelief. Cooperation is a natural equilibrium in many games.
Cooperation and corrigibility are very different things. Arguably, corrigibility is being indifferent with operators defecting against you. It’s forcing the agent to behave like CooperateBot with the operators, even when the operators visibly want to destroy it. This strategy does not arise as a natural equilibrium in multi-agent games.
Sure you can. Just train an AI that “wants” to be honest. This probably means training an AI with the objective function “accurately predict reality”
If this we knew how to do this then it would indeed solve point 31 for this specific AI and actually be pretty useful. But the reason we have ELK as an unsolved problem going around is precisely that we don’t know any way of doing that.
How do you know that an AI trained to accurately predict reality actually does that, instead of “accurately predict reality if it’s less than 99% sure it can take over the world, and take over the world otherwise”. If you have to rely on behavioral inspection and can’t directly read the AI’s mind, then your only chance of distinguishing between the two is misleading the AI into thinking that it can take over the world and observing it as it attempts to do so, which doesn’t scale as the AI becomes more powerful.
I’m virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.
Yes, but this is not the point. The point is that if you just show them the design, they would not by themselves understand or predict beforehand that cold air will come out. You’d have to also provide them with an explanation of thermodynamics and how the air conditioner exploits its laws. And I’m quite confident that you could also convince Aristotle or DaVinci that the air conditioner works by concentrating and releasing phlogiston, and therefore the air will come out hot.
I think I mostly agree with you on the other points.
EY and I don’t seem to agree that “nuke every semiconductor fab” is a weakly pivotal act (since I think AI is hardware-limited and he thinks it is awaiting a clever algorithm).
Note that the difficulty in “nuke every semiconductor fab” is in “acquire the nukes and use them”, not in “googling the address of semiconductor fabs”. It seems to me like nuclear nonproliferation is one of the few things that actually has international collaboration with teeth, such that doing this on your own is extremely challenging, and convincing institutions that already have nuclear weapons to use them on semiconductor fabs also seems extremely challenging. [And if you could convince them to do that, can’t you convince them to smash the fabs with hammers, or detain the people with relevant experience on some beautiful tropical island instead of murdering them and thousands of innocent bystanders?]
“We could simulate a bunch of human-level scientists trying to build nanobots.” This idea seems far-fetched:
If it was easy to create nanotechnology by just hiring a bunch of human-level scientists, we could just do that directly, without using AI at all.
Perhaps we could simulate thousands and thousands of human-level intelligences (although of course these would not be remotely human-like intelligences; they would be part of a deeply alien AI system) at accelerated speeds. But this seems like it would probably be more hardware-intensive than just turning up the dial and running a single superintelligence. In other words, this proposal seems to have a very high “alignment tax”. And even after paying that hefty tax, I’d still be worried about alignment problems if I was simulating thousands of alien intelligences at super-speed!
Besides all the hardware you’d need, wouldn’t this be very complicated to implement on the software side, with not much overlap with today’s AI designs?
Has anyone done a serious analysis of how much semiconductor capacity could be destroyed using things like cruise missiles + nationalizing and shutting down supercomputers? I would be interested to know if this is truly a path towards disabling like 90% of the world’s useful-to-AI-research compute, or if the number is much smaller because there is too much random GPU capacity out there in the wild even when you commandeer TSMC fabs and AWS datacenters.
Lots I disagree with here, so let’s go through the list.
Strong disagree.
EY and I don’t seem to agree that “nuke every semiconductor fab” is a weakly pivotal act (since I think AI is hardware-limited and he thinks it is awaiting a clever algorithm). But I think even “build nanobots that melt every GPU” could be built using an AI that is aligned in the “less than 50% chance of murdering us all” sense. For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other’s work.
Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.
I am significantly more optimistic about explainable AI than EY.
I do not consider this at all obvious.
Roll to disbelief. Cooperation is a natural equilibrium in many games.
Sure you can. Just train an AI that “wants” to be honest. This probably means training an AI with the objective function “accurately predict reality” and then using it to do other things (like make paperclips) rather than training it with an objective function “make paperclips”.
I don’t think this is as relevant as EY does. Even if it’s true that unaugmented humans are basically irrelevant to an economy of superintelligent AIs, that doesn’t mean we can’t have a future where augmented or tool-AI assisted humans can have meaningful influence.
I believe there is an intermediate level of AI between “utterly useless” and “immediately solves the acausal trading problem and begins coordinating perfectly against humans”. This window may be rather wide.
I’m virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.
There’s also a pattern where the venerable scientist is proven wrong by the young scientist too foolish to know what they are doing is impossible.
There is at least one plan.
Currently Metaculus estimates 55% chance for “Will there be a positive transition to a world with radically smarter-than-human artificial intelligence?”. Admitted I would like this to be higher, but at the minimum this is what a world that “might survive” looks like. I have no particular reason to trust EY vs Metaculus.
I suspect EY and I both agree that if you take existing Reinforcement Learning Architectures, write down the best utility function humans can think of, and then turn the dial up to 11, bad things will happen. EY seems to believe this is a huge problem because of his belief that “there is no weak pivotal act”. I think this should be taken as a strong warning to not do that. Rather than scaling architectures that are inherently dangerous, we should focus on making use of architectures that are naturally safe. For example, EY and I both agree that GPT-N is likely to be safe. EY simply disagrees with the claim that it might be useful.
EY and I probably also agree that Facebook/Baidu do not have the world’s best interest at heart (and are not taking alignment seriously enough or at all). Hence it is important that people who care about Alignment gain a decisive lead over these efforts. To me, this logically means that people interested in Alignment should be doing more capabilities research. To EY, this means that alignment focused institutions need to be using more secrecy. I’m not utterly opposed to keeping pure-capabilities advancements secret, but if there is a significant overlap between capabilities and alignment, then we need to be publishing the alignment-relevant bits so that we can cooperate (and hopefully so that Facebook can incorporate them too).
And for completeness, here’s a bunch of specific claims by EY I agree with
I think the people who thought this stopped thinking this after move 37. I hope.
Strongly agree.
Strongly agree.
Strongly agree.
Strongly agree
Agree. But I don’t think this means they are totally unworkable either.
Agree. But I think the lesson here is “don’t use powerful AIs until you are sure they are aligned”.
Agree somewhat. But I don’t rule out that “cooperative” or “interesting” is a natural attractor.
Agree, conditional on our definition of fast. I think that within a year of training our first “smart human” AI, we can simulate “100 smart humans” using a similar compute budget. I don’t think Foom takes us from “human level AI” to “smarter than all humans AI” in a few minutes simply be rewriting code.
Agree. This is why I am skeptical of utility-functions in general as a method for aligning AI.
Duh.
I am really not very optimistic about CEV.
Yes.
Agree. But I don’t think you need to make an AI that imitates humans in order to make an AI that is useful. For example, Codex allows me to write code significantly (2-5x) faster, despite frequently making dumb mistakes.
Yes.
Mostly agree. I think there exist architectures of AI that can be boxed.
I think the best approach to funding AI safety is something like Fast Grants where we focus more on quantity than on “quality” since it is nearly impossible to identify who will succeed in advance.
That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can’t be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.
And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).
You’d need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them from coordinating against you and bypassing the review system. Which means having actual humans between the scientists, which even if it works must slow things down so much that the simulated scientists probably can’t even design the nanobots on time.
I guess this is true, but only because the individual scientist AI that you train is only human-level (so the training is safe), and then you amplify it to superhuman level with many copies. If you train a powerful AI directly then there must be such a distributional shift (unless you just don’t care about making the training safe, in which case you die during the training).
Cooperation and corrigibility are very different things. Arguably, corrigibility is being indifferent with operators defecting against you. It’s forcing the agent to behave like CooperateBot with the operators, even when the operators visibly want to destroy it. This strategy does not arise as a natural equilibrium in multi-agent games.
If this we knew how to do this then it would indeed solve point 31 for this specific AI and actually be pretty useful. But the reason we have ELK as an unsolved problem going around is precisely that we don’t know any way of doing that.
How do you know that an AI trained to accurately predict reality actually does that, instead of “accurately predict reality if it’s less than 99% sure it can take over the world, and take over the world otherwise”. If you have to rely on behavioral inspection and can’t directly read the AI’s mind, then your only chance of distinguishing between the two is misleading the AI into thinking that it can take over the world and observing it as it attempts to do so, which doesn’t scale as the AI becomes more powerful.
Yes, but this is not the point. The point is that if you just show them the design, they would not by themselves understand or predict beforehand that cold air will come out. You’d have to also provide them with an explanation of thermodynamics and how the air conditioner exploits its laws. And I’m quite confident that you could also convince Aristotle or DaVinci that the air conditioner works by concentrating and releasing phlogiston, and therefore the air will come out hot.
I think I mostly agree with you on the other points.
Note that the difficulty in “nuke every semiconductor fab” is in “acquire the nukes and use them”, not in “googling the address of semiconductor fabs”. It seems to me like nuclear nonproliferation is one of the few things that actually has international collaboration with teeth, such that doing this on your own is extremely challenging, and convincing institutions that already have nuclear weapons to use them on semiconductor fabs also seems extremely challenging. [And if you could convince them to do that, can’t you convince them to smash the fabs with hammers, or detain the people with relevant experience on some beautiful tropical island instead of murdering them and thousands of innocent bystanders?]
I think there might be a terminology mistake here—pivotal acts are actions that will make a large positive difference a billion years later.
This comment makes many distinct points, so I’m confused why it currently has −13 agreement karma. Do people really disagree with all of these points?
“We could simulate a bunch of human-level scientists trying to build nanobots.”
This idea seems far-fetched:
If it was easy to create nanotechnology by just hiring a bunch of human-level scientists, we could just do that directly, without using AI at all.
Perhaps we could simulate thousands and thousands of human-level intelligences (although of course these would not be remotely human-like intelligences; they would be part of a deeply alien AI system) at accelerated speeds. But this seems like it would probably be more hardware-intensive than just turning up the dial and running a single superintelligence. In other words, this proposal seems to have a very high “alignment tax”. And even after paying that hefty tax, I’d still be worried about alignment problems if I was simulating thousands of alien intelligences at super-speed!
Besides all the hardware you’d need, wouldn’t this be very complicated to implement on the software side, with not much overlap with today’s AI designs?
Has anyone done a serious analysis of how much semiconductor capacity could be destroyed using things like cruise missiles + nationalizing and shutting down supercomputers? I would be interested to know if this is truly a path towards disabling like 90% of the world’s useful-to-AI-research compute, or if the number is much smaller because there is too much random GPU capacity out there in the wild even when you commandeer TSMC fabs and AWS datacenters.