Apparently the probability of a Carrington-like event/large coronal mass ejection is about 2% per decade, so maybe it’s 2% for an extremely severe one every half century. If time from AGI to it leaving the planet is a half century, maybe 2% chance of the grid getting fried is enough of a risk that it keeps humans around for the time being. After that there might be less of an imperative for it to re-purpose the earth, and so we survive.
Glutted AI. Feed it almost maximum utils automatically anyway, so that it has far shallower gradient between current state and maximalist behaviour, if it’s already got some kind of future discounting in effect, it might just do nothing except occasionally give out very good ideas and be comfortable with us making slower progress as long as existential risk remains relatively low.
Note there are several versions of “short sighted AI”. I thought of one that hasn’t been proposed using the properties of low resolution integers. What you are describing is to give it a very high discount rate so it only cares about basically right now.
Either way, for toy problems like “collect n stamps, and you get a reward of 1.0 if you have n stamps at each timestep”, the idea is that the machine doesn’t see a positive reward for a risky move like “I take over the government and while I might get destroyed and lose my stamps, I might win and then over an infinite timespan get to tile the earth with stamps so I have a closer to 100% chance of having all n stamps each timestep”.
The high discount rate means the machine is more ‘scared’ of the possible chance of being destroyed in the near future due to humans reacting to it’s violent overthrow plans and it downvotes to zero the possible distant reward of having a lot of stamps.
That plan has very high risks in the short term, is very complex, and only achieves a very distant reward. (you avoid a future 100 years from now where an asteroid or aliens invading might have destroyed your stamps, but since you have tiled the earth with stamps and killed all humans there will be at least n left)
Another bad idea: why not use every possible alignment strategy at once (or many of them)? Presumably this would completely hobble the AGI, but with some interpretability you could find where the bottlenecks to behaviour are in the system and use it as a lab to figure out best options. Still a try-once strategy I guess, and maybe it precludes actually getting to AGI in the first place, since you can’t really iterate on an AI that doesn’t work.
What I noticed was that these failures he describes implicitly require the math the AI is doing to have infinite precision.
Something like “ok I have met my goal of collecting 10 stamps by buying 20 stamps in 2 separate vaults, time to sleep” fails if the system is able to consider the possibility of a <infinitesimal and distant in time future event where an asteroid destroys the earth>. So if we make the system unable to consider such a future by making the numerical types it uses round to zero it will instead sleep.
Maximizers have a similar failure. Their take over the planet plan often involves a period of time where they are not doing their job of making paperclips or whatever, but they defer future reward while they build weapons to take over the government. And the anticipated reward of their doomsday plan often looks like: Action0: [.99 * 1000 reward: doing my job] Action1 : [0.99 * 0 reward: destroyed],[0.01 x discounted big reward: took over the government]
This is expressible as an MDP above and I have considered writing a toy model so I can find out numerically if this works.
My real world experience has a number of systems using old processor designs where the chip itself doesn’t make a type above 16-24 bit integers usable, so I had some experience with dealing with such issues. Also at my current role we’re using a lot of 8 and 16 bit int/floats to represent neural network weights.
Semi tongue-in-cheek sci-fi suggestion.
Apparently the probability of a Carrington-like event/large coronal mass ejection is about 2% per decade, so maybe it’s 2% for an extremely severe one every half century. If time from AGI to it leaving the planet is a half century, maybe 2% chance of the grid getting fried is enough of a risk that it keeps humans around for the time being. After that there might be less of an imperative for it to re-purpose the earth, and so we survive.
Second one I just had that might be naive.
Glutted AI. Feed it almost maximum utils automatically anyway, so that it has far shallower gradient between current state and maximalist behaviour, if it’s already got some kind of future discounting in effect, it might just do nothing except occasionally give out very good ideas and be comfortable with us making slower progress as long as existential risk remains relatively low.
Note there are several versions of “short sighted AI”. I thought of one that hasn’t been proposed using the properties of low resolution integers. What you are describing is to give it a very high discount rate so it only cares about basically right now.
Either way, for toy problems like “collect n stamps, and you get a reward of 1.0 if you have n stamps at each timestep”, the idea is that the machine doesn’t see a positive reward for a risky move like “I take over the government and while I might get destroyed and lose my stamps, I might win and then over an infinite timespan get to tile the earth with stamps so I have a closer to 100% chance of having all n stamps each timestep”.
The high discount rate means the machine is more ‘scared’ of the possible chance of being destroyed in the near future due to humans reacting to it’s violent overthrow plans and it downvotes to zero the possible distant reward of having a lot of stamps.
That plan has very high risks in the short term, is very complex, and only achieves a very distant reward. (you avoid a future 100 years from now where an asteroid or aliens invading might have destroyed your stamps, but since you have tiled the earth with stamps and killed all humans there will be at least n left)
Can you explain low-resolution integers?
Another bad idea: why not use every possible alignment strategy at once (or many of them)? Presumably this would completely hobble the AGI, but with some interpretability you could find where the bottlenecks to behaviour are in the system and use it as a lab to figure out best options. Still a try-once strategy I guess, and maybe it precludes actually getting to AGI in the first place, since you can’t really iterate on an AI that doesn’t work.
Can you explain low-resolution integers?
From Robert Mile’s videos:
What I noticed was that these failures he describes implicitly require the math the AI is doing to have infinite precision.
Something like “ok I have met my goal of collecting 10 stamps by buying 20 stamps in 2 separate vaults, time to sleep” fails if the system is able to consider the possibility of a <infinitesimal and distant in time future event where an asteroid destroys the earth>. So if we make the system unable to consider such a future by making the numerical types it uses round to zero it will instead sleep.
Maximizers have a similar failure. Their take over the planet plan often involves a period of time where they are not doing their job of making paperclips or whatever, but they defer future reward while they build weapons to take over the government. And the anticipated reward of their doomsday plan often looks like: Action0: [.99 * 1000 reward: doing my job] Action1 : [0.99 * 0 reward: destroyed],[0.01 x discounted big reward: took over the government]
This is expressible as an MDP above and I have considered writing a toy model so I can find out numerically if this works.
My real world experience has a number of systems using old processor designs where the chip itself doesn’t make a type above 16-24 bit integers usable, so I had some experience with dealing with such issues. Also at my current role we’re using a lot of 8 and 16 bit int/floats to represent neural network weights.