TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 4 Oct 2022 5:04 UTC
LW: 8 AF: 5
AF
I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...
And insofar as this impression is correct, this is a mistake. There is only one way alignment is.
If inner/outer is altogether a more faithful picture of those dynamics:
- relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
  - more fragility of value and difficulty in getting the mesa objective just right, with little to nothing in terms of “consolation prizes” for slight mistakes in value loading
- possibly low path dependence on the update process
then we have to solve alignment in that world.
If shard theory is altogether more faithful, then we live under those dynamics:
- gents learn contextual distributions of values around e.g. help people or acquire coins, some of which cohere and equilibrate into the agent’s endorsed preferences and eventual utility function
- something like values handshakes and inner game theory occurs in AI
- we can focus on getting a range of values endorsed and thereby acquire value via being “at the bargaining table” vis some human-compatible values representing themselves in the final utility function
  - which implies meaningful success and survival from “partial alignment”
And under these dynamics, inner and outer alignment are antinatural hard problems.
Or maybe neither of these pictures are correct and reasonable, and alignment is some other way.
But either way, there’s one way alignment is. And whatever way that is, it is against that anvil that we hammer the AI’s cognition with loss updates. When considering a research agenda, you aren’t choosing a background set of alignment dynamics as well.
What links here?
- Inner and outer alignment decompose one hard problem into two extremely hard problems by TurnTrout (2 Dec 2022 2:43 UTC; 142 points)