agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor.
Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.
Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.