(This is my first post, sorry if this is covered elsewhere.)
Implicit in the problem of a superhuman AI’s reward being misspecified and turning everything into paperclips is the fact that the agent is optimizing over e.g. “number of paperclips” without any particular time-bound, area-of-effect bound, or probability confidence bound. For example, imagine if a MuZero+++++ agent were given the reward function “maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off”, where e.g. “1 day from now” was set by a consensus of satellites and/or deep-space probes. The agent could do a lot of bad things via its misspecified reward, but at least the blast-radius seems like it’d probably be limited. This agent might cause a lot of damage, but it probably wouldn’t end the world in pursuit of these objectives.
This is not a strategy for executing a Yudkowskian “pivotal act”, which by necessity would need to have longer-lasting effects. So it wouldn’t stop some other group from creating unfriendly AGI. But as a general AI safety heuristic to try and spread across AGI practitioners, “do this as much as possible for a fixed time and then shutoff” seems like a promising social intervention. We also know how to implement it today.
As far as specific risks from the pursuit of convergent instrumental goals: “capability gain” and “make sure you don’t get shutoff” seem much less likely to be convergent instrumental goals when you are planning over a short fixed horizon. “Create hidden copies of yourself” is still possibly a deployable strategy, but there’s no reason for those copies to pursue a reward beyond the time-bound described, so I’d hold out hope for us to find a patch. “Deception” is again possible in the short term, but given this reward function there’s no clear reason to deceive beyond a fixed horizon.
More broadly, this is a result of my thinking about AI safety social heuristics/memes that could be spreadable/enforceable by centralized power structures (e.g. governments, companies, militaries). If others have thoughts about similar heuristics, I’d be very interested to hear them.
I’m assuming I’m not the first person to bring this up, so I’m wondering whether someone can point me to existing discussion on this sort of fixed-window reward. If it is novel in any sense, feedback extremely welcome. This is my first contribution to this community, so please be gentle but also direct.
Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.
Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that’s not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.
At some point, this trend starts to turn itself around: the AGI becomes so shortsighted that it can’t be too destructive, and becomes relatively easy to control.
But where is the turnaround point? It depends hugely on the AGI’s capabilities. An uber-capable AI might be capable of doing a lot of damage within hours. Even setting the time horizon to seconds seems basically risky; do you want to bet everything on the assumption that such a shortsighted AI will do minimal damage and be easy to control?
This is why some people, such as Evan H, have been thinking about extreme forms of myopia, where the system is supposed to think only of doing the specific thing it was asked to do, with no thoughts of future consequences at all.
Now, there are (as I see it) two basic questions about this.
How do we make sure that the system is actually as limited as we think it is?
How do we use such a limited system to do anything useful?
Question #1 is incredibly difficult and I won’t try to address it here.
Question #2 is also challenging, but I’ll say some words.
Getting useful work out of extremely myopic systems.
As you scale down the time horizon (or scale up the temporal discounting, or do other similar things), you can also change the reward function. (Or utility function, or other equivalent thing is in whatever formalism.) We don’t want something that spasmodically tries to maximize the human fulfillment experienced in the next three seconds. We actually want something that approximates the behavior of a fully-aligned long-horizon AGI. We just want to decrease the time horizon to make it easier to trust, easier to control, etc.
The strawman version of this is: choose the reward function for the totally myopic system to approximate the value function which the long-time-horizon aligned AGI would have.
If you do this perfectly right, you get 100% outer-aligned AI. But that’s only because you get a system that’s 100% equivalent to the not-at-all-myopic aligned AI system we started with. This certainly doesn’t help us build safe systems; it’s only aligned by hypothesis.
Where things get interesting is if we approximate that value function in a way we trust. An AGI RL system with supposedly aligned reward function calculates its value function by looking far into the future and coming up with plans to maximize reward. But, we might not trust all the steps in this process enough to trust the result. For example, we think small mistakes in the reward function tend to be amplified to large errors in the value function.
In contrast, we might approximate the value function by having humans look at possible actions and assign values to them. You can think of this as deontological: kicking puppies looks bad, curing cancer looks good. You can try to use machine learning to fit these human judgement patterns. This is the basic idea of approval-directed agents. Hopefully, this creates a myopic system which is incapable of treacherous turns, because it just tries to do what is “good” in the moment rather than doing any planning ahead. (One complication with this is inner alignment problems. It’s very plausible that to imitate human judgements, a system has to learn to plan ahead internally. But then you’re back to trying to outsmart a system that can possibly plan ahead of you; IE, you’ve lost the myopia.)
There may also be many other ways to try to approximate the value function in more trustable ways.
This is not actually a limited-horizon agent; you’ve just set a time at which it changes objectives. And wouldn’t ending the world be the most reliable way to ensure pesky humans never turn you back on?...
(unfortunately thinking about constraints you can place on an unaligned agent never leads anywhere useful; alignment is the only workable solution in the long term)
Sorry, to clarify, I’m not saying it should change objectives. If we’re assuming it’s maximizing long-term expected reward, then it will not be rewarded for adding more money to the bank beyond the relevant window. So its optimal behavior is “make as much money as possible right now and then shut myself off”. It could be that “ensuring the ability to shut oneself off” involves killing all humans, but that seems… unlikely? Relative to the various ways one could make more money. It seems like there could be a reasonable parameter choice that would make money almost definitionally more appealing than moving certainty of being able to shutoff from 0.9999 to 0.99999. Especially if we just gave it access to a self-shutdown switch and made it very difficult for us to prevent it from shutting itself down.
Also, the reward isn’t conditioned on “stay turned off”, so there’s no reason to ensure humans never turn you back on.
I think there’s a separate argument to be made that the AI might want to extend the period of time where it can earn a reward other than the shutdown reward, and thus that it’d try to push the “shutdown-reward” time back, but there are ways of making this really, really hard.
For example, if the AI is only “alive” for a day, and we launch a space probe 10 years in advance with a private key, and the probe emits the current timestamp signed with the private key. The AI is hardcoded to only accept timestamps by decrypting a timestamp-candidate using the corresponding public key, and to reject invalid timestamps (e.g. longer than the length before the last sun goes out).
Maybe the one additional wrinkle I should add is that the shutdown reward decreases over time, at first very gradually (to prevent it from going nuts trying to shut itself down immediately) but then much more quickly as time goes on, e.g. a month out from the beginning. Importantly, for runaway scenarios like “capability gain”, the question is not, “could the AI decide to chase the probe down”, but rather, “when it is slightly-above-human-level but not yet god-level, would it plausibly think that becoming god-level would substantially increase its probability of chasing the probe down, when it would cost it a great deal of time that would reduce its shutdown reward?”
I suppose this relies on us actually believing that cryptographic one-way functions are hard enough that they’d take longer than the age of the universe to break. But lattice-based crypto is quantum-resilient, so an AI could plausibly have a prior that it’d be better to shut itself off now than to go for a riskier strategy that would rely on a lot of R&D risk.
But also, couldn’t “keeping your reward function intact” be extended to the timer used within the reward function?
Again, I’m not advocating this as “definitely 100% safe”, but as a heuristic to build into basically all experiments as we near AGI, it seems worthwhile as a decent failsafe.
The person deploying the time-limited agent has a longer horizon. If they want their bank balance to keep growing, then presumably they will deploy a new copy of the agent tomorrow, and another copy the day after that. These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.
So a sequence of time-limited agents could still develop instrumental power-seeking. You could try to avert this by deploying a *different* agent each day, but then you miss out on the gains from intertemporal coordination, so the performance isn’t competitive with an unaligned benchmark.
Not really, due to the myopia of the situation. I think this may provide a better approach for reasoning about the behavior of myopic optimization.
I like the approach. Here is where I got applying it to our scenario:
m is a policy for day trading
L(m) is expected 1-day return
D(m) is the “trading environment” produced by m. Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past. The iteration becomes
mn+1=argminmL(m;mn,mn−1,…).
In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies.
Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss L∗, so we could at least write it down and see if it makes sense?
L∗(x)=L(x;x,x,…)−argminmL(m;x,x,…)
So we’re optimizing for “How well x performs in an environment where it’s been trading forever, compared to how well the optimal policy performs in that environment”.
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Optimizers of L∗ will behave as though they have a long time horizon, even though the original loss L was myopic.
The initial part all looks correct. However, something got lost here:
Because it’s true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won’t have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗.
On the other hand, a modified version of the long-term trader which sets up “traps” that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
It’s true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, “To what degree can someone exploit your environment better than you can?”.
If you’re limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.
I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version—agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)
Consider two possible agents A and A’.
A optimizes for 1-day expected return.
A’ optimizes for 10-day expected return under the assumption that a new copy of A’ will be instantiated each day.
I claim that A’ will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days).
So even if we’re training the agent by rewarding it for 1-day expected return, we should expect to get A’ rather than A.
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up. If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t.
If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
(You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)