I’ll explain the issues I have with Eliezer Yudkowsky’s position in a nutshell:
Alignment is almost certainly easier than Yudkowsky or the extremely pessimistic people think. In particular, alignment is progressing way more than the extremely pessimistic models predicted.
I don’t think that slowing AI down instead of accelerating alignment is the best choice, primarily because I think we should mostly try to improve our chances on the current path than overturn our current path.
Given that I am an optimist on AI safety, I don’t really agree with Eliezer’s suggestions on how AI should be dealt with.
No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).
No. 2 I don’t quite understand—we care about the relative speed of the two, wouldn’t anything that says “let’s move people from capabilities to alignment research” like the moratorium asks do exactly what you say? You can’t arbitrarily speed up one field without affecting the other, really, human resources are limited, and you won’t get much mileage out of suddenly funding thousands of CS graduates to work on alignment while the veterans all advance capabilities. There’s trade offs at work. You need to actively rebalance your resources to avoid alignment just playing catch up. It’s kind of an essential part of the work anyway; imagine NASA having hundreds of engineers developing gigantic rocket engines and a team of like four guys working on control systems. You can’t go to the Moon with raw power alone.
No. 3 depends heavily on what we think the consequences of misaligned AGI are. How dangerous Eliezer’s proposal is also depends on how much do countries want to develop AGI. Consider e.g. biological weapons, where it’s probably fairly easy to get a consensus on “let’s just not do that” once everyone has realised how expensive, complex and ultimately still likely to blow up in your face they are. Vice versa, if alignment is easy, there’s probably no reason why anyone would put such an agreement in place; but there needs to be some evidence of alignment being easy, and we need it soon. We can’t wait for the point where if we’re wrong the AI destroys the world to find out. That’s not called a plan, that’s just called being lucky, if it goes well.
No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).
The good news is we have better techniques than RLHF, which as you note is not particularly useful as an alignment technique.
On alignment not making sense as a concept, I agree somewhat. In the case of an AI and a human, I think that alignment is sensible, but as you scale up, it increasingly devolves into nonsense until you just force your own values.
No. 2 I don’t quite understand—we care about the relative speed of the two, wouldn’t anything that says “let’s move people from capabilities to alignment research” like the moratorium asks do exactly what you say?
Not exactly, though what I’m envisioning is that you can use a finetuned AI to do alignment research, and while there are capabilities externalities, they may be necessary depending on how much feedback we need in order to solve the alignment problem.
Also, I think part of the disagreement is we are coming from different starting points on how much progress we did on alignment.
No. 3 depends heavily on what we think the consequences of misaligned AGI are.
This is important, but there are other considerations.
For example, the most important thing to think about with Eliezer’s plan is what ethics/morals do you use.
Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.
The problem is Eliezer’s treaty would basically imply GPU production is enough to start a war. This is a much more severe consequence than almost any treaty ever done, and this has very negative impacts under a consequentialist ethical system.
evidence of alignment being easy, and we need it soon. We can’t wait for the point where if we’re wrong the AI destroys the world to find out. That’s not called a plan, that’s just called being lucky, if it goes well.
I think this is another point of disagreement. While I wouldn’t like to test the success without dignity hypothesis, also known as luck, I do think there’s a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.
Under international law, counterfeiting another nation’s currency is considered an act of war and you can “legally” go to war to stop it… if you can bomb a printing press, is it ridiculous to say you can’t have a treaty that says you can bomb a GPU foundry?
(The two most recent cases of a government actually counterfeiting another nation’s currency were Nazi Germany during World War II which made counterfeit British pounds as part of its military strategy, and the “supernote” US dollar produced by North Korea.)
And in the end no one bombed North Korea, because saying something is an act of war doesn’t imply automatic war anyway, it’s subtler than that. Honestly in the hypothetical “no GPUs” world you’d probably have all the major States agreeing it’s a danger to them and begrudgingly cooperating on those lines, and the occasional pathetic attempt by some rogue actor with nothing to lose might be nipped in the bud via sanctions or threats. The big question really is how detectable such attempts would be compared to developing e.g. bacteriological weapons. But if tomorrow we found out that North Korea is developing Super Smallpox and plans to release it, what would we do? We are already in a similar world, we just don’t think much about it because we’ve gotten used to this being the precarious equilibrium we exist in.
Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.
I find this sort of argument kinda nonsensical. Like, yes, it’s useful to conceptualise goods and harms as positives and negatives you balance, but in practice you can’t literally put numbers on them and run the sums, especially not with so many uncertainties at stake. It’s always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there. I am a consequentialist and I think that overall AGI is on the net probably bad for humanity, and I include also some possible outcomes from aligned AGI in there.
I do think there’s a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.
I don’t think it’s that improbable either, I just think it’s irresponsible either way when so much is at stake. I think the biggest possible points of failure of the doom argument are:
we just aren’t able to build AGI any soon (but in that case the whole affair turns out to be much ado about nothing), or
we are able to build AGI, but then AGI can’t really push past to ASI. This might be purely chance, or the result of us using approaches that merely “copy” human intelligence but aren’t able to transcend it (for example, if becoming superintelligent would require being trained on text written by superintelligent entities)
So, sure, we may luck out, thought that leaves us “only” with already plenty disruptive human-level AGI. Regardless, this makes the world potentially a much more unstable powder keg. Even without going specifically down the road EY mentions, I think nuclear and MAD analogies do apply because the power in play is just that great (in fact am writing a post on this, will go up tomorrow if I can finish it).
It’s always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there.
Is this not simply the fallacy of gray?
As saying goes, it’s easy to lie with statistics, but even easier to lie without them. Certainly you can fudge the numbers to make the result say anything, but if you show your work then the fudging gets more obvious.
I agree that laying out your thinking at least forces you to specifically elucidate your values. That way people can criticise the precise assumptions they disagree with, and you can’t easily back out of them. I don’t think the “lying with statistics” saying applies in its original meaning because really this is entirely about subjective terminal values. “Because I like it this way” is essentially what it boils down to no matter how you slice it.
In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there.
You’re right that it isn’t an objective calculation, and apparently it requires more subjective assumptions, so I’ll agree that we really shouldn’t be treating this as though it’s an objective calculation.
I don’t think it’s that improbable either, I just think it’s irresponsible either way when so much is at stake.
I agree that testing that hypothesis is dangerously irresponsible, given the stakes involved. That’s why I still support alignment work.
I think the biggest things if success without dignity happens, I think it will be due to some of the following factors:
Alignment turns out to be really easy by default, that is something like the naive ideas like RLHF just work, or it turns out that value learning is almost trivial.
Corrigibility is really easy or trivial to do, such that alignment isn’t relevant, because humans can redirect it’s goals easily. In particular, it’s easy to get AIs to respect a shutdown order.
We can’t make AGI, or it’s too hard to progress AGI to ASI.
These are the major factors I view as likely in a success without dignity case, or we survive AGI/ASI via luck.
I find 1 unlikely, 2 almost impossible (or rather, it would imply partial alignment, in which at least you managed to impress Asimov’s Second Law of Robotics into your AGI above all else), and 3 the most likely, but also unstable (what if your 10^8 instances of AGI engineers suddenly achieve a breakthrough after 20 years of work?). So this doesn’t seem particularly satisfying to me.
Responding to your #1, do you think we’re on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.
16: outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. 17: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over. 18: There’s no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is ’aligned′ 19: there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment
I’ll explain the issues I have with Eliezer Yudkowsky’s position in a nutshell:
Alignment is almost certainly easier than Yudkowsky or the extremely pessimistic people think. In particular, alignment is progressing way more than the extremely pessimistic models predicted.
I don’t think that slowing AI down instead of accelerating alignment is the best choice, primarily because I think we should mostly try to improve our chances on the current path than overturn our current path.
Given that I am an optimist on AI safety, I don’t really agree with Eliezer’s suggestions on how AI should be dealt with.
No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).
No. 2 I don’t quite understand—we care about the relative speed of the two, wouldn’t anything that says “let’s move people from capabilities to alignment research” like the moratorium asks do exactly what you say? You can’t arbitrarily speed up one field without affecting the other, really, human resources are limited, and you won’t get much mileage out of suddenly funding thousands of CS graduates to work on alignment while the veterans all advance capabilities. There’s trade offs at work. You need to actively rebalance your resources to avoid alignment just playing catch up. It’s kind of an essential part of the work anyway; imagine NASA having hundreds of engineers developing gigantic rocket engines and a team of like four guys working on control systems. You can’t go to the Moon with raw power alone.
No. 3 depends heavily on what we think the consequences of misaligned AGI are. How dangerous Eliezer’s proposal is also depends on how much do countries want to develop AGI. Consider e.g. biological weapons, where it’s probably fairly easy to get a consensus on “let’s just not do that” once everyone has realised how expensive, complex and ultimately still likely to blow up in your face they are. Vice versa, if alignment is easy, there’s probably no reason why anyone would put such an agreement in place; but there needs to be some evidence of alignment being easy, and we need it soon. We can’t wait for the point where if we’re wrong the AI destroys the world to find out. That’s not called a plan, that’s just called being lucky, if it goes well.
The good news is we have better techniques than RLHF, which as you note is not particularly useful as an alignment technique.
On alignment not making sense as a concept, I agree somewhat. In the case of an AI and a human, I think that alignment is sensible, but as you scale up, it increasingly devolves into nonsense until you just force your own values.
Not exactly, though what I’m envisioning is that you can use a finetuned AI to do alignment research, and while there are capabilities externalities, they may be necessary depending on how much feedback we need in order to solve the alignment problem.
Also, I think part of the disagreement is we are coming from different starting points on how much progress we did on alignment.
This is important, but there are other considerations.
For example, the most important thing to think about with Eliezer’s plan is what ethics/morals do you use.
Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.
The problem is Eliezer’s treaty would basically imply GPU production is enough to start a war. This is a much more severe consequence than almost any treaty ever done, and this has very negative impacts under a consequentialist ethical system.
I think this is another point of disagreement. While I wouldn’t like to test the success without dignity hypothesis, also known as luck, I do think there’s a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.
Under international law, counterfeiting another nation’s currency is considered an act of war and you can “legally” go to war to stop it… if you can bomb a printing press, is it ridiculous to say you can’t have a treaty that says you can bomb a GPU foundry?
(The two most recent cases of a government actually counterfeiting another nation’s currency were Nazi Germany during World War II which made counterfeit British pounds as part of its military strategy, and the “supernote” US dollar produced by North Korea.)
And in the end no one bombed North Korea, because saying something is an act of war doesn’t imply automatic war anyway, it’s subtler than that. Honestly in the hypothetical “no GPUs” world you’d probably have all the major States agreeing it’s a danger to them and begrudgingly cooperating on those lines, and the occasional pathetic attempt by some rogue actor with nothing to lose might be nipped in the bud via sanctions or threats. The big question really is how detectable such attempts would be compared to developing e.g. bacteriological weapons. But if tomorrow we found out that North Korea is developing Super Smallpox and plans to release it, what would we do? We are already in a similar world, we just don’t think much about it because we’ve gotten used to this being the precarious equilibrium we exist in.
I find this sort of argument kinda nonsensical. Like, yes, it’s useful to conceptualise goods and harms as positives and negatives you balance, but in practice you can’t literally put numbers on them and run the sums, especially not with so many uncertainties at stake. It’s always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there. I am a consequentialist and I think that overall AGI is on the net probably bad for humanity, and I include also some possible outcomes from aligned AGI in there.
I don’t think it’s that improbable either, I just think it’s irresponsible either way when so much is at stake. I think the biggest possible points of failure of the doom argument are:
we just aren’t able to build AGI any soon (but in that case the whole affair turns out to be much ado about nothing), or
we are able to build AGI, but then AGI can’t really push past to ASI. This might be purely chance, or the result of us using approaches that merely “copy” human intelligence but aren’t able to transcend it (for example, if becoming superintelligent would require being trained on text written by superintelligent entities)
So, sure, we may luck out, thought that leaves us “only” with already plenty disruptive human-level AGI. Regardless, this makes the world potentially a much more unstable powder keg. Even without going specifically down the road EY mentions, I think nuclear and MAD analogies do apply because the power in play is just that great (in fact am writing a post on this, will go up tomorrow if I can finish it).
Is this not simply the fallacy of gray?
As saying goes, it’s easy to lie with statistics, but even easier to lie without them. Certainly you can fudge the numbers to make the result say anything, but if you show your work then the fudging gets more obvious.
I agree that laying out your thinking at least forces you to specifically elucidate your values. That way people can criticise the precise assumptions they disagree with, and you can’t easily back out of them. I don’t think the “lying with statistics” saying applies in its original meaning because really this is entirely about subjective terminal values. “Because I like it this way” is essentially what it boils down to no matter how you slice it.
You’re right that it isn’t an objective calculation, and apparently it requires more subjective assumptions, so I’ll agree that we really shouldn’t be treating this as though it’s an objective calculation.
I agree that testing that hypothesis is dangerously irresponsible, given the stakes involved. That’s why I still support alignment work.
I think the biggest things if success without dignity happens, I think it will be due to some of the following factors:
Alignment turns out to be really easy by default, that is something like the naive ideas like RLHF just work, or it turns out that value learning is almost trivial.
Corrigibility is really easy or trivial to do, such that alignment isn’t relevant, because humans can redirect it’s goals easily. In particular, it’s easy to get AIs to respect a shutdown order.
We can’t make AGI, or it’s too hard to progress AGI to ASI.
These are the major factors I view as likely in a success without dignity case, or we survive AGI/ASI via luck.
I find 1 unlikely, 2 almost impossible (or rather, it would imply partial alignment, in which at least you managed to impress Asimov’s Second Law of Robotics into your AGI above all else), and 3 the most likely, but also unstable (what if your 10^8 instances of AGI engineers suddenly achieve a breakthrough after 20 years of work?). So this doesn’t seem particularly satisfying to me.
Responding to your #1, do you think we’re on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.