I tend to think of this through the lens of the AIXI model—what assumptions does it make and what does it predict? First, one assumes that the environment is an unknown element of the class of computable probability distributions (those induces by probabilistic Turing machines). Then the universal distribution is a highly compelling choice, because it dominates this call while also staying inside it. Unfortunately the computability level does worsen when we consider optimal action based on this belief distribution. Now we must express some coherent preference ordering over action/percept histories, which can be represented as a utility function by VNM. Hutter further assumed it could be expressed as a reward signal, which is a kind of locality condition, but I don’t think it is necessary for the model to be useful. This convenient representation allows us to write down a clean specification of AIXI’s behavior, relating its well-specified belief distribution and utility function to action choice. It is true that setting aside the reward representation, choosing an arbitrary utility function can justify any action sequence for AIXI (I haven’t seen this proven but it seems trivial because all AIXI assigns positive probability to any finite history prefix), but in a way this misses the point: the mathematical machinery we’ve built up allows us to translate conclusions about AIXI’s preference ordering to its sequential action choices and vice versa through the intermediary step of constraining its utility function.
I tend to think of this through the lens of the AIXI model—what assumptions does it make and what does it predict? First, one assumes that the environment is an unknown element of the class of computable probability distributions (those induces by probabilistic Turing machines). Then the universal distribution is a highly compelling choice, because it dominates this call while also staying inside it. Unfortunately the computability level does worsen when we consider optimal action based on this belief distribution. Now we must express some coherent preference ordering over action/percept histories, which can be represented as a utility function by VNM. Hutter further assumed it could be expressed as a reward signal, which is a kind of locality condition, but I don’t think it is necessary for the model to be useful. This convenient representation allows us to write down a clean specification of AIXI’s behavior, relating its well-specified belief distribution and utility function to action choice. It is true that setting aside the reward representation, choosing an arbitrary utility function can justify any action sequence for AIXI (I haven’t seen this proven but it seems trivial because all AIXI assigns positive probability to any finite history prefix), but in a way this misses the point: the mathematical machinery we’ve built up allows us to translate conclusions about AIXI’s preference ordering to its sequential action choices and vice versa through the intermediary step of constraining its utility function.