I think this method works well as an extra layer of precaution to go along with another measure of reduced impact. On its own, it has a few issues, some of which you cover.
First of all, I’d replace the utility function with a reward function, specifically one that provides rewards for past achievements. Why? Well, in general, utility functions give too much of an incentive to keep control of the future. “Create a subagent and turn yourself off” is my general critique of these kinds of methods; if the subagent is powerful enough, the best policy for the agent could be to create them and then turn themselves off for T=1 or some similarly low number.
Having a reward function on past achievements precludes that, and it also means the agent is not incentivised to continue past T; indeed part of the definition of the reward could be that it stops at T.
When using human judgements, normally the risk is that the AI is incentivised to fool us. Here, however, the AI is on a time budget, so it might find it easier to be “honest” than to put in the time and effort to fool us. Another approach is to use indifference so that it doesn’t care about the human decision ahead of time.
General relativity doesn’t seem much of an issue. Just pick a reference frame (say one centred on the AI at time t=0 and with relative velocity 0 to the AI)anddefine"T$” as the t=T hyperplane. Because of the (reverse) triangle inequality, any path the AI takes to reach this hyperplane will give it τ≤T proper time in which to act. If we worry about wormholes and such, we could even define T to be the minimum of the reference frame time (t) and the AI’s proper time (τ), to be really sure that the AI doesn’t get too much time to think.
Hey there!
I think this method works well as an extra layer of precaution to go along with another measure of reduced impact. On its own, it has a few issues, some of which you cover.
First of all, I’d replace the utility function with a reward function, specifically one that provides rewards for past achievements. Why? Well, in general, utility functions give too much of an incentive to keep control of the future. “Create a subagent and turn yourself off” is my general critique of these kinds of methods; if the subagent is powerful enough, the best policy for the agent could be to create them and then turn themselves off for T=1 or some similarly low number.
Having a reward function on past achievements precludes that, and it also means the agent is not incentivised to continue past T; indeed part of the definition of the reward could be that it stops at T.
When using human judgements, normally the risk is that the AI is incentivised to fool us. Here, however, the AI is on a time budget, so it might find it easier to be “honest” than to put in the time and effort to fool us. Another approach is to use indifference so that it doesn’t care about the human decision ahead of time.
General relativity doesn’t seem much of an issue. Just pick a reference frame (say one centred on the AI at time t=0 and with relative velocity 0 to the AI)anddefine"T$” as the t=T hyperplane. Because of the (reverse) triangle inequality, any path the AI takes to reach this hyperplane will give it τ≤T proper time in which to act. If we worry about wormholes and such, we could even define T to be the minimum of the reference frame time (t) and the AI’s proper time (τ), to be really sure that the AI doesn’t get too much time to think.