I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:
The transitivity property is actually important and necessary, one can construct money-pump-like situations if it isn’t satisfied. See this comment
If we keep transitivity, but not completeness, and follow a strategy of not making choices inconsistent with out previous choices, as EJT suggests, then we no longer have a single consistent utility function. However, it looks like the behaviour can still be roughly described as “picking a utility function at random, and then acting according to that utility function”. See this comment.
In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor. By tuning β we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They’re trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.
agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor.
Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.
I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:
The transitivity property is actually important and necessary, one can construct money-pump-like situations if it isn’t satisfied. See this comment
If we keep transitivity, but not completeness, and follow a strategy of not making choices inconsistent with out previous choices, as EJT suggests, then we no longer have a single consistent utility function. However, it looks like the behaviour can still be roughly described as “picking a utility function at random, and then acting according to that utility function”. See this comment.
In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor. By tuning β we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They’re trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.
Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.