Can someone point me to discussion of the chance that an AGI catastrophe happens but doesn’t kill all humans? As a silly example that may still be pointing at something real, say someone builds the genie and asks it to end all strife on earth; it takes the obvious route of disintegrating the planet (no strife on earth ever again!), but now it stops (maxed out utility function, so no incentive to also turn the lightcone into computronium), and the survivors up in Elon’s Mars base take safety seriously next time?
If you can get an AGI to destroy all life on Earth but then stop and leave Mars alone, you’ve cracked AGI alignment; you can get AGIs to do big powerful things without side effects.
How would having AGI that have 50% chance to obliterate lightcone, 40% to obliterate just Earth and 10% to correctly produce 1000000 of paperclips without casualties solve the alignment?
Taking over the lightcone is the default behavior. If you can create an AGI which doesn’t do this, you’ve already figured out how to put some constraint on its activities. Notably, not destroying the lightcone implies that the AGI doesn’t create other AGIs which go off and destroy the lightcone.
When you say “create an AGI which doesn’t do this” do you mean that it has about 0% probability of doing it or one that have less than 100% probability of doing it?
Edit: my impression was that the point of alignment was producing an AGI that have high probability of good outcomes and low probability of bad outcomes. Creating an AGI that simply have low probability of destroying the universe seems to be trivial. Take a hypothetical AGI before it produced output, throw a coin and if its tails then destroy it. Voila, the probability of destroying the universe is now at most 50%. How can you even have device that is guaranteed to destroy universe if on early stages it can be stopped by sufficiently paranoid developer or a solar flare?
I don’t see how your scenario addresses the statement “Taking over the lightcone is the default behavior”. Yes, it’s obvious that you can build an AGI and then destroy it before you turn it on. You can also choose to just not build one at all with no coin flip. There’s also the objection that if you destroy it before you turn it on, have you really created an AGI, or just something that potentially might have been an AGI?
It also doesn’t stop other people from building one. If theirs destroys all human value in the future lightcone by default, then you still have just as big a problem.
Presumably a Bayesian reasoner using expected value would never reach max utility, because there would always be a non-zero probability that the goal hasn’t been achieved, and the course of action which increases its success estimate from 99.9999% to 99.99999999% probably involves turning part of the universe into computronium.
No one has yet solved “and then stop” for AGI even though this should be easier than a generic stop button which in turn should be easier than full corrigibility. (Also I don’t think we know how to refer to things in the world in a way that gets an AI to care about it rather than observations of it or its representation of it)
This 12 minute Robert Miles video is a good introduction to the basic argument for why “stopping at destroying earth, and not proceeding to convert the universe into computronium” is implausible.
An excellent video. But in my traditional role as “guy who is not embarrassed to ask stupid questions”, I have a nitpick:
I didn’t instantly see why the expected utility satisficer would turn itself into a maximiser.
After a few minutes of thinking I’ve got as far as:
Out of all the packets I could send, sending something that is a copy of me (with a tiny hack to make it a maximiser) onto another computer will create an agent which (since I am awesome) will almost certainly acquire a vast number of stamps.
Which runs into a personal-identity-type problem:
*I’m* an agent. If my goal is to acquire a fairly nice girlfriend, then the plan “create a copy of me that wants the best possible girlfriend”, doesn’t really help me achieve my goal. In fact it might be a bit of a distraction.
I’m pretty sure there’s a real problem here, but the argument in the video doesn’t slam the door on it for me. Can anyone come up with something better?
Can someone point me to discussion of the chance that an AGI catastrophe happens but doesn’t kill all humans? As a silly example that may still be pointing at something real, say someone builds the genie and asks it to end all strife on earth; it takes the obvious route of disintegrating the planet (no strife on earth ever again!), but now it stops (maxed out utility function, so no incentive to also turn the lightcone into computronium), and the survivors up in Elon’s Mars base take safety seriously next time?
If you can get an AGI to destroy all life on Earth but then stop and leave Mars alone, you’ve cracked AGI alignment; you can get AGIs to do big powerful things without side effects.
How would having AGI that have 50% chance to obliterate lightcone, 40% to obliterate just Earth and 10% to correctly produce 1000000 of paperclips without casualties solve the alignment?
Taking over the lightcone is the default behavior. If you can create an AGI which doesn’t do this, you’ve already figured out how to put some constraint on its activities. Notably, not destroying the lightcone implies that the AGI doesn’t create other AGIs which go off and destroy the lightcone.
When you say “create an AGI which doesn’t do this” do you mean that it has about 0% probability of doing it or one that have less than 100% probability of doing it?
Edit: my impression was that the point of alignment was producing an AGI that have high probability of good outcomes and low probability of bad outcomes. Creating an AGI that simply have low probability of destroying the universe seems to be trivial. Take a hypothetical AGI before it produced output, throw a coin and if its tails then destroy it. Voila, the probability of destroying the universe is now at most 50%. How can you even have device that is guaranteed to destroy universe if on early stages it can be stopped by sufficiently paranoid developer or a solar flare?
I don’t see how your scenario addresses the statement “Taking over the lightcone is the default behavior”. Yes, it’s obvious that you can build an AGI and then destroy it before you turn it on. You can also choose to just not build one at all with no coin flip. There’s also the objection that if you destroy it before you turn it on, have you really created an AGI, or just something that potentially might have been an AGI?
It also doesn’t stop other people from building one. If theirs destroys all human value in the future lightcone by default, then you still have just as big a problem.
I don’t see why all possible ways for AGI to critically fail to do what we have build it for must involve taking over the lightcone.
So let’s also blow up the Earth. By that definition the alignment would be solved.
Presumably a Bayesian reasoner using expected value would never reach max utility, because there would always be a non-zero probability that the goal hasn’t been achieved, and the course of action which increases its success estimate from 99.9999% to 99.99999999% probably involves turning part of the universe into computronium.
No one has yet solved “and then stop” for AGI even though this should be easier than a generic stop button which in turn should be easier than full corrigibility. (Also I don’t think we know how to refer to things in the world in a way that gets an AI to care about it rather than observations of it or its representation of it)
This 12 minute Robert Miles video is a good introduction to the basic argument for why “stopping at destroying earth, and not proceeding to convert the universe into computronium” is implausible.
An excellent video. But in my traditional role as “guy who is not embarrassed to ask stupid questions”, I have a nitpick:
I didn’t instantly see why the expected utility satisficer would turn itself into a maximiser.
After a few minutes of thinking I’ve got as far as:
Out of all the packets I could send, sending something that is a copy of me (with a tiny hack to make it a maximiser) onto another computer will create an agent which (since I am awesome) will almost certainly acquire a vast number of stamps.
Which runs into a personal-identity-type problem:
*I’m* an agent. If my goal is to acquire a fairly nice girlfriend, then the plan “create a copy of me that wants the best possible girlfriend”, doesn’t really help me achieve my goal. In fact it might be a bit of a distraction.
I’m pretty sure there’s a real problem here, but the argument in the video doesn’t slam the door on it for me. Can anyone come up with something better?
Exactly what I was looking for, thanks!