It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn’t make the sacrifice.
That doesn’t seem like behavior we really want; depending on how closely together the “timesteps” are spaced, it could even wreck the agent’s capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.
(It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don’t appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don’t think I’d end up going to the store, or doing much of anything at all!)
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep
That’s not quite right. If we’re comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn’t apply, because it’s a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn’t say anything about cases where neither X nor Y timestep dominates the other. For all we’ve said so far, the agent could have any preference relation between such lotteries.
That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that’s the case, then we’ll want to train the agent to satisfy other principles besides Timestep Dominance. And there’s still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don’t know but I’m working on it.
It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory
Can you say a bit more about this? Humans don’t reason by Timestep Dominance, but they don’t do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn’t make the sacrifice.
That doesn’t seem like behavior we really want; depending on how closely together the “timesteps” are spaced, it could even wreck the agent’s capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.
(It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don’t appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don’t think I’d end up going to the store, or doing much of anything at all!)
That’s not quite right. If we’re comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn’t apply, because it’s a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn’t say anything about cases where neither X nor Y timestep dominates the other. For all we’ve said so far, the agent could have any preference relation between such lotteries.
That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that’s the case, then we’ll want to train the agent to satisfy other principles besides Timestep Dominance. And there’s still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don’t know but I’m working on it.
Can you say a bit more about this? Humans don’t reason by Timestep Dominance, but they don’t do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.