DaemonicSigil answers What do coherence arguments actually prove about agentic behavior?

DaemonicSigil 1 Jun 2024 19:06 UTC
15 points
6
I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:
- The transitivity property is actually important and necessary, one can construct money-pump-like situations if it isn’t satisfied. See this comment
- If we keep transitivity, but not completeness, and follow a strategy of not making choices inconsistent with out previous choices, as EJT suggests, then we no longer have a single consistent utility function. However, it looks like the behaviour can still be roughly described as “picking a utility function at random, and then acting according to that utility function”. See this comment.
In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action $a$ is proportional to $exp (β E [U | a])$ up to a normalization factor. By tuning $β$ we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They’re trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.
What links here?
- sunwillrise's comment on What do coherence arguments actually prove about agentic behavior? by sunwillrise (2 Jun 2024 14:58 UTC; 7 points)
- Thomas Kwa 2 Jun 2024 3:29 UTC
  6 points
  0
  Parent
  agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor.
  Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.