It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep
That’s not quite right. If we’re comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn’t apply, because it’s a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn’t say anything about cases where neither X nor Y timestep dominates the other. For all we’ve said so far, the agent could have any preference relation between such lotteries.
That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that’s the case, then we’ll want to train the agent to satisfy other principles besides Timestep Dominance. And there’s still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don’t know but I’m working on it.
It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory
Can you say a bit more about this? Humans don’t reason by Timestep Dominance, but they don’t do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.
That’s not quite right. If we’re comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn’t apply, because it’s a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn’t say anything about cases where neither X nor Y timestep dominates the other. For all we’ve said so far, the agent could have any preference relation between such lotteries.
That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that’s the case, then we’ll want to train the agent to satisfy other principles besides Timestep Dominance. And there’s still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don’t know but I’m working on it.
Can you say a bit more about this? Humans don’t reason by Timestep Dominance, but they don’t do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.