This is one of the more minor suggestions, just a small tweak to help solve a specific issue.
Discounting time
The issue is the strange behaviour that agents with discount rates have in respect to time.
Quickly, what probability would you put on time travel being possible?
I hope, as good Bayesians, you didn’t answer 0% (those who did should look here). Let’s assume, for argument sake, that you answered 0.1%.
Now assume that you have a discount rate of 10% per year (many putative agent designs use discount rates for convergence or to ensure short time-horizons, where they can have discount rates of 90% per second or even more extreme). By the end of 70 years, the utility will be discounted to roughly 0.1%. Thus, from then on (plus or minus a few years), the highest expected value action for you is to search for ways of travelling back in time, and do all you stuff then.
This is perfectly time-consistent: given these premisses, you’d want the “you in a century” to search frantically for a time-machine, as the actual expected utility increase they could achieve is tiny.
If you were incautions enough to have discount rates that go back into the past as well as the future, then you’d already be searching frantically for a time-machine, for the tiniest change of going back to the big bang and having an impact there...
Continual corrigibility
We want the agents we design to apply the discount rate looking to the future only, not towards the past. To do so, we can apply corrigibility (see also here). This allows us to change an agent’s utility function, reward it (in utility) for any cost involve with the change.
The logical thing to do is to corrige the agent’s utility function to something that doesn’t have such an extreme value in the past. At the moment of applying corrigibility, cut off the agent’s utility at the present moment, and replace the past values with something much smaller. You could just set it to zero (though as a mathematician my first instincts was to make it slope symmetrically down towards the past as it does towards the future—making the present the most important time ever!).
This correction could be applied regularly, maybe even continuously, removing the incentive to search desperately for ways to try and affect the past.
Note that this is not a perfect cure—an AI could create subagents that will research time-travel and come back to the present day to increase its current (though not future) utility, bringing extra resources. A way of reducing risk that could be to have a “maximal utility” (a bound on how high utility can go at any given point) that sharply reduces the possible impact of time-travelling subagents. This bound could be lifted going into the future, to allow the AI more freedom to increase it.
A more specific approach to dealing with subagents will be presented soon.
A more general method?
This is just a use of corrigibility to solve a specific problem, but it’s very possible that there are other problems that corrigibility could be similarly successfully applied to. Anything where the form of the utility function made sense at one point, but became a drag at a later date.
Continually-adjusted discounted preferences
A putative new idea for AI control; index here.
This is one of the more minor suggestions, just a small tweak to help solve a specific issue.
Discounting time
The issue is the strange behaviour that agents with discount rates have in respect to time.
Quickly, what probability would you put on time travel being possible?
I hope, as good Bayesians, you didn’t answer 0% (those who did should look here). Let’s assume, for argument sake, that you answered 0.1%.
Now assume that you have a discount rate of 10% per year (many putative agent designs use discount rates for convergence or to ensure short time-horizons, where they can have discount rates of 90% per second or even more extreme). By the end of 70 years, the utility will be discounted to roughly 0.1%. Thus, from then on (plus or minus a few years), the highest expected value action for you is to search for ways of travelling back in time, and do all you stuff then.
This is perfectly time-consistent: given these premisses, you’d want the “you in a century” to search frantically for a time-machine, as the actual expected utility increase they could achieve is tiny.
If you were incautions enough to have discount rates that go back into the past as well as the future, then you’d already be searching frantically for a time-machine, for the tiniest change of going back to the big bang and having an impact there...
Continual corrigibility
We want the agents we design to apply the discount rate looking to the future only, not towards the past. To do so, we can apply corrigibility (see also here). This allows us to change an agent’s utility function, reward it (in utility) for any cost involve with the change.
The logical thing to do is to corrige the agent’s utility function to something that doesn’t have such an extreme value in the past. At the moment of applying corrigibility, cut off the agent’s utility at the present moment, and replace the past values with something much smaller. You could just set it to zero (though as a mathematician my first instincts was to make it slope symmetrically down towards the past as it does towards the future—making the present the most important time ever!).
This correction could be applied regularly, maybe even continuously, removing the incentive to search desperately for ways to try and affect the past.
Note that this is not a perfect cure—an AI could create subagents that will research time-travel and come back to the present day to increase its current (though not future) utility, bringing extra resources. A way of reducing risk that could be to have a “maximal utility” (a bound on how high utility can go at any given point) that sharply reduces the possible impact of time-travelling subagents. This bound could be lifted going into the future, to allow the AI more freedom to increase it.
A more specific approach to dealing with subagents will be presented soon.
A more general method?
This is just a use of corrigibility to solve a specific problem, but it’s very possible that there are other problems that corrigibility could be similarly successfully applied to. Anything where the form of the utility function made sense at one point, but became a drag at a later date.