Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.
Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that’s not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.
At some point, this trend starts to turn itself around: the AGI becomes so shortsighted that it can’t be too destructive, and becomes relatively easy to control.
But where is the turnaround point? It depends hugely on the AGI’s capabilities. An uber-capable AI might be capable of doing a lot of damage within hours. Even setting the time horizon to seconds seems basically risky; do you want to bet everything on the assumption that such a shortsighted AI will do minimal damage and be easy to control?
This is why some people, such as Evan H, have been thinking about extreme forms of myopia, where the system is supposed to think only of doing the specific thing it was asked to do, with no thoughts of future consequences at all.
Now, there are (as I see it) two basic questions about this.
How do we make sure that the system is actually as limited as we think it is?
How do we use such a limited system to do anything useful?
Question #1 is incredibly difficult and I won’t try to address it here.
Question #2 is also challenging, but I’ll say some words.
Getting useful work out of extremely myopic systems.
As you scale down the time horizon (or scale up the temporal discounting, or do other similar things), you can also change the reward function. (Or utility function, or other equivalent thing is in whatever formalism.) We don’t want something that spasmodically tries to maximize the human fulfillment experienced in the next three seconds. We actually want something that approximates the behavior of a fully-aligned long-horizon AGI. We just want to decrease the time horizon to make it easier to trust, easier to control, etc.
The strawman version of this is: choose the reward function for the totally myopic system to approximate the value function which the long-time-horizon aligned AGI would have.
If you do this perfectly right, you get 100% outer-aligned AI. But that’s only because you get a system that’s 100% equivalent to the not-at-all-myopic aligned AI system we started with. This certainly doesn’t help us build safe systems; it’s only aligned by hypothesis.
Where things get interesting is if we approximate that value function in a way we trust. An AGI RL system with supposedly aligned reward function calculates its value function by looking far into the future and coming up with plans to maximize reward. But, we might not trust all the steps in this process enough to trust the result. For example, we think small mistakes in the reward function tend to be amplified to large errors in the value function.
In contrast, we might approximate the value function by having humans look at possible actions and assign values to them. You can think of this as deontological: kicking puppies looks bad, curing cancer looks good. You can try to use machine learning to fit these human judgement patterns. This is the basic idea of approval-directed agents. Hopefully, this creates a myopic system which is incapable of treacherous turns, because it just tries to do what is “good” in the moment rather than doing any planning ahead. (One complication with this is inner alignment problems. It’s very plausible that to imitate human judgements, a system has to learn to plan ahead internally. But then you’re back to trying to outsmart a system that can possibly plan ahead of you; IE, you’ve lost the myopia.)
There may also be many other ways to try to approximate the value function in more trustable ways.
Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.
Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that’s not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.
At some point, this trend starts to turn itself around: the AGI becomes so shortsighted that it can’t be too destructive, and becomes relatively easy to control.
But where is the turnaround point? It depends hugely on the AGI’s capabilities. An uber-capable AI might be capable of doing a lot of damage within hours. Even setting the time horizon to seconds seems basically risky; do you want to bet everything on the assumption that such a shortsighted AI will do minimal damage and be easy to control?
This is why some people, such as Evan H, have been thinking about extreme forms of myopia, where the system is supposed to think only of doing the specific thing it was asked to do, with no thoughts of future consequences at all.
Now, there are (as I see it) two basic questions about this.
How do we make sure that the system is actually as limited as we think it is?
How do we use such a limited system to do anything useful?
Question #1 is incredibly difficult and I won’t try to address it here.
Question #2 is also challenging, but I’ll say some words.
Getting useful work out of extremely myopic systems.
As you scale down the time horizon (or scale up the temporal discounting, or do other similar things), you can also change the reward function. (Or utility function, or other equivalent thing is in whatever formalism.) We don’t want something that spasmodically tries to maximize the human fulfillment experienced in the next three seconds. We actually want something that approximates the behavior of a fully-aligned long-horizon AGI. We just want to decrease the time horizon to make it easier to trust, easier to control, etc.
The strawman version of this is: choose the reward function for the totally myopic system to approximate the value function which the long-time-horizon aligned AGI would have.
If you do this perfectly right, you get 100% outer-aligned AI. But that’s only because you get a system that’s 100% equivalent to the not-at-all-myopic aligned AI system we started with. This certainly doesn’t help us build safe systems; it’s only aligned by hypothesis.
Where things get interesting is if we approximate that value function in a way we trust. An AGI RL system with supposedly aligned reward function calculates its value function by looking far into the future and coming up with plans to maximize reward. But, we might not trust all the steps in this process enough to trust the result. For example, we think small mistakes in the reward function tend to be amplified to large errors in the value function.
In contrast, we might approximate the value function by having humans look at possible actions and assign values to them. You can think of this as deontological: kicking puppies looks bad, curing cancer looks good. You can try to use machine learning to fit these human judgement patterns. This is the basic idea of approval-directed agents. Hopefully, this creates a myopic system which is incapable of treacherous turns, because it just tries to do what is “good” in the moment rather than doing any planning ahead. (One complication with this is inner alignment problems. It’s very plausible that to imitate human judgements, a system has to learn to plan ahead internally. But then you’re back to trying to outsmart a system that can possibly plan ahead of you; IE, you’ve lost the myopia.)
There may also be many other ways to try to approximate the value function in more trustable ways.