Joe Collman comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Joe Collman 17 Oct 2023 14:45 UTC
7 points
2
A few points here:
- We don’t have the option to “trick our reward systems forever”—e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all).
- The divergence between stated preferences and revealed preferences is exactly what we’d expect to see in worlds where we’re constantly “tricking our reward system” in small ways: our revealed preferences are not what we think they “actually should be”.
- We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It’s not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour.
- Again, I’d ask for an example of a world plausibly reachable through an evolutionary process where we don’t have the kind of coherence and consistency you’re talking about.
  
  Being completely orthogonal to evolution clearly isn’t plausible, since we wouldn’t be here (I note that when I don’t care about x, I sacrifice x to get what I do care about—I don’t take actions that are neutral with respect to x).
  Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.