Jacob G-W comments on Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Jacob G-W Mar 17, 2025, 3:53 AM
7 points
7
I think the way you use utility monster is not how it is normally used. It’s normally used to mean an agent that “receives much more utility from each unit of a resource that it consumes than anyone else does” (https://en.wikipedia.org/wiki/Utility_monster).
- Roland Pihlakas Mar 18, 2025, 2:39 AM
  6 points
  0
  Parent
  I renamed the phenomenon to “runaway optimiser”. I hope this label illustrates the inappropriately unbounded and single-minded nature of the failure modes we observed. How does that sound to you, does that capture the essence of the phenomena described in the post?
  - Jacob G-W Mar 18, 2025, 3:05 AM
    2 points
    0
    Parent
    Better, thanks!
- Roland Pihlakas Mar 17, 2025, 4:41 PM
  1 point
  −3
  Parent
  Thank you for pointing that out! I agree, there are couple of nuances. Our perspective can be treated as a generalisation of the original utility monster scenario. Although I consider it to be not first such generalisation—think of the examples in Bostrom’s book.
  
  1) In our case, the dilemma is not “agent versus others”, but instead “one objective versus other objectives”. One objective seems to get more internal/subjective utility from consumption than another objective. Thus the agent focuses on a single objective only.
  2) Consideration of homeostatic objectives introduces a new aspect to the utility monster problem—the behaviour of the original utility monster looks unaligned to begin with, not just dominating. It is unnatural for a being to benefit from indefinite consumption. It looks like the original utility monster has an eating disorder! It enjoys eating apples so much that it does not care about the consequences to the future (“other”) self. That means, even the utility monster may actually suffer from “too much consumption”. But it does not recognise it and therefore it consumes indefinitely. Alternatively, just as a paperclip maximiser does not produce the paper clips for themselves—if the utility monster is an agent, then somebody else suffers from homeostasis violations while the agent is being “helpful” in an unaligned and naive way. Technically, this can be seen as a variation of the multi-objective problem—active avoidance of overconsumption could be treated as an “other” objective, while consumption is the dominating and inaccurately linear “primary” objective with a non-diminishing utility.
  
  In conclusion, our perspective is a generalisation: whether the first objective is for agent’s own benefit and the other objective for the benefit of others, is left unspecified in our case. Likewise, violating homeostasis can be a scenario where an unaligned agent gets a lot of internal/subjective “utility” from making you excessively happy or from overfeeding you, while you are the one who suffers from overwhelm or overconsumption.
  
  Hope that clears things up? I am also curious, would you like to share, what might be an alternative short name for the phenomena desribed in this post?