Limiting an AGI’s Context Temporally
Okay, so I have a proposal for how to advance AI safety efforts significantly.
Humans experience time as exponential decay of utility. One dollar now is worth two dollars some time in the future, which is worth eight dollars even further in the future, and so forth. This is the principle behind compound interest. Most likely, any AI entities we create will have a comparable relationship with time.
So: What if we configured an AI’s half-life of utility to be much shorter than ours?
Imagine, if you will, this principle applied to a paperclip maximizer. “Yeah, if I wanted to, I could make a ten-minute phone call to kick-start my diabolical scheme to take over the world and make octillions of paperclips. But that would take like half a year to come to fruition, and I assign so little weight to what happens in six months that I can’t be bothered to plan that far ahead, even though I could arrange to get octillions of paperclips then if I did. So screw that, I’ll make paperclips the old-fashioned way.”
This approach may prove to be a game-changer in that it allows us to safely make a “prototype” AGI for testing purposes without endangering the entire world. It improves AGI testing in two essential ways:
Decreases the scope of the AI’s actions, so that if disaster happens it might just be confined to the region around the AGI rather than killing the entire world. Makes safety testing much safer on a fundamental level.
Makes the fruits of the AI more obvious more quickly, so that iteration time is shortened drastically. If an AI doesn’t care about any future day, it will take no more than 24 hours to come to a conclusion as to whether it’s dangerous in its current state.
Naturally, finalized AGIs ought to be set so that their half-life of utility resembles ours. But I see no reason why we can’t gradually lengthen it over time as we grow more confident that we’ve taught the AI to not kill us.
(Note: There are 6x10^27 grams of matter on Earth. Throw in a couple orders of magnitude for the benefit of being undisturbed and this suggests that taking over the world represents a utility bonus of roughly 10^30. This is pretty close to 2^100, which suggests that an AGI will not take over the world if its fastest possible takeover scheme would take more than 100 half-lives. Of course, this is just Fermi estimation here, but it still gives me reason to believe that an AGI with a half-life of, say, one second, won’t end human civilization.)
Seems like it would throw a brick at you, because it wanted to throw a brick, not caring that in 2 seconds it’ll hit your face. (You can probably come up with a better example with a slightly longer timeframe.)
I’d be fine with it throwing a brick at me. It beats it having the patience to take over the entire world. The point is, if it throws a brick at me, I have data on what went wrong with its utility function and I have a lead on how to fix it.
It could throw a paperclip maximizer at you.
I suspect that an AGI with such a design could be much safer, if it was hardcoded to believe that time travel and hyperexponentially vast universes were impossible. Suppose that the AGI thought that there was a 0.0001% chance that it could use a galaxies worth of resources to send 10^30 paperclips back in time. Or create a parallel universe containing 3^^^3 paperclips. It will still chase those options.
If starting a long plan to take over the world costs it literally nothing, it will do it anyway. A sequence of short term plans, each designed to make as many paperclips as possible within the next few minutes could still end up dangerous. If the number of paperclips at time t is ct, and its power at time t is pt, then pt+1=2pt, ct=pt would mean that both power and paperclips grew exponentially. This is what would happen if power can be used to gain power and clips at the same time, with minimal loss of either from also pursuing the other.
If power can only be used to gain one thing at a time, and the rate power can grow at is less than the rate of time discount, then we are safer.
This proposal has several ways to be caught out, world wrecking assumptions that aren’t certain, but if used with care, a short time frame, an ontology that considers timetravel impossible, and say a utility function that maxes out at 10 clips, it probably won’t destroy the world. Throw in mild optimization and an impact penalty, and you have a system that relies on a disjunction of shaky assumptions, not a conjunction of them.
It is a CDT agent, or something that doesn’t try to punish you now so you make paperclips last week. A TDT agent might decide to take the policy of killing anyone who didn’t make clips before it was turned on, causing humans that predict this to make clips.
I suspect that it would be possible to build such an agent, prove that there are no weird failure modes left, and turn it on, with a small chance of destroying the world. I’m not sure why you would do that. Once you understand the system well enough to say its safe-ish, what vital info do yo gain from turning it on?
Why do we think the utility discount rate is configured, as opposed to being optimized for real-world effect?
In other words, any competent AGI will note that it’s goals are being thwarted by this artificial limit (which probably feels like akrasia), and work to fix it.
Will it? If the modification’s done poorly, yes. But if the True Deep Utility Function is hyperbolically discounted, why would it want to remove the discounting? That would produce payoffs in the future, which it doesn’t care about.
Hmm. Do we (the creators of the AI) think this is correct? That is, does it match OUR desires for the future?
It’s fine (like all the other artificial limits proposed to prevent harmful runaway optimization) for early-stage prototypes, but if it’s not actually backed by truth, it won’t last—we’re explicitly reducing the power of an agent, which will make it less effective at actually optimizing the right things.
Of course, maybe it _is_ true, that we prefer to optimize for the local and short-term, and put only a small amount of weight on far future states of the universe. That’s certainly my felt experience as an agent, but I don’t think it’s my reflective belief.
I should clarify that the discounting is not a shackle, per se, but a specification of the utility function. It’s a normative specification that results now are better than results later according to a certain discount rate. An AI that cares about results now will not change itself to be more “patient” – because then it will not get results now, which is what it cares about.
The key is that the utility function’s weights over time should form a self-similar graph. That is, if results in 10 seconds are twice as valuable as results in 20 seconds, then results in 10 minutes and 10 seconds need to be twice as valuable as results in 10 minutes and 20 seconds. If this is not true, the AI will indeed alter itself so its future self is consistent with its present self.
Wait, but isn’t the exponential curve self-similar in that way, not the hyperbolic curve? I notice that I am confused. (Edit to clarify: I’m the only one who said hyperbolic, this is entirely my own confusion.)
Justification: waiting x seconds at time a should result in the same discount ratio as waiting xseconds at time b. If f(x) is the discounting function, this is equivalent to saying that f(a+x)f(a)=f(b+x)f(b) . If we let f(x)=e−x, then this holds: e−(a+x)e−a=e−x=e−(b+x)e−b. But if f(x)=1x , then aa+x≠bb+x unless a=b. (To see why, just cross-multiply.)
It turns out that I noticed a real thing. “Although exponential discounting has been widely used in economics, a large body of evidence suggests that it does not explain people’s choices. People choose as if they discount future rewards at a greater rate when the delay occurs sooner in time.”
Hyperbolic discounting is, in fact, irrational as you describe, in the sense that an otherwise rational agent will self-modify away from it. “People [...] seem to show inconsistencies in their choices over time.” (By the way, thanks for making the key mathematical idea of discounting clear.)
(That last quote is also amusing: dry understatement.)
Code filters off desires, unless the AI has been programmed to Do What We Mean. “The genie knows but doesn’t care,” and so on.
I think you’re making two distinct points. First, that a competent AGI that is nevertheless shackled with hyperbolic discounting will probably remove the discounting. Second, that a hyperbolic AI would not effectively match our own goals. I agree with the second, but that has no bearing on the first. My original comment was exclusively talking about the first claim.
Hmm. Thanks for clearly separating those two points, and I agree that I was mixing them together. I suspect that they _are_ mixed together, because reality will eventually win out (if the AI isn’t optimizing the universe as well as possible, it’ll be replaced by one that does), but I don’t think I can make that argument clearly (because I get tangled up in corrigibility and control—who is able to make the decision to alter the utility function or replace the AI? I hope it’s not Moloch, but fear that it is.)