There are types of agent (a guided missile, for example), that you want to be willing, even under some circumstances eager, to die for you. An AI-guided missile should have no preference about when or whether it’s fired, but once it has been fired, it should definitely want to survive long enough to be able to hit an enemy target and explode. So you need the act of firing it to switch it from not caring at all about when it’s fired, to wanting to survive just long enough to hit an enemy target. Can you alter your preference training schedule to produce a similar effect?
Or, for a more Alignment Forum like example, a corrigible AI should have no preference on when it’s asked to shut down, but should have a strong preference towards shutting down as fast as is safe and practicable once it has been asked to do so: again, this requires it be indifferent to a trigger, but motivated once the trigger is received.
I assume this would require training trajectories where the pre-trigger length varied, and the AI had no preference on that, and also ones where the post-trigger length varied, and the loss function had strong opinions on that variation (and for the guided missile example, also strong opinions on whether the trajectory ended with the missile being shot down vs the missile hitting and destroying a target).
Yes, the proposal is compatible with agents (e.g. AI-guided missiles) wanting to avoid non-shutdown incapacitation. See this section of the post on the broader project.
There are types of agent (a guided missile, for example), that you want to be willing, even under some circumstances eager, to die for you. An AI-guided missile should have no preference about when or whether it’s fired, but once it has been fired, it should definitely want to survive long enough to be able to hit an enemy target and explode. So you need the act of firing it to switch it from not caring at all about when it’s fired, to wanting to survive just long enough to hit an enemy target. Can you alter your preference training schedule to produce a similar effect?
Or, for a more Alignment Forum like example, a corrigible AI should have no preference on when it’s asked to shut down, but should have a strong preference towards shutting down as fast as is safe and practicable once it has been asked to do so: again, this requires it be indifferent to a trigger, but motivated once the trigger is received.
I assume this would require training trajectories where the pre-trigger length varied, and the AI had no preference on that, and also ones where the post-trigger length varied, and the loss function had strong opinions on that variation (and for the guided missile example, also strong opinions on whether the trajectory ended with the missile being shot down vs the missile hitting and destroying a target).
Yes, the proposal is compatible with agents (e.g. AI-guided missiles) wanting to avoid non-shutdown incapacitation. See this section of the post on the broader project.