David Udell comments on David Udell’s Shortform

David Udell 4 May 2022 22:33 UTC
3 points
Agents that explicitly represent their utility function are potentially vulnerable to sign flips.

What sorts of AI designs could not be made to pursue a flipped utility function via perturbation in one spot? One quick guess: an AI that represents its utility function in several places and uses all of those representations to do error correction, only pursuing the error corrected utility function.
- Dagon 4 May 2022 23:56 UTC
  4 points
  Parent
  Just a phrasing/terminology nitpick: I think this applies to agents with externally-imposed utility functions. If an agent has a “natural” or “intrinsic” utility function which it publishes explicitly (and does not accept updates to that explicit form), I think the risk of bugs in representation does not occur.