He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I’ve written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.
Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we’re only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).
First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I’ve written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.
Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we’re only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).