Rohin Shah comments on Selection Theorems: A Program For Understanding Agents

Rohin Shah 10 Oct 2021 13:06 UTC
LW: 4 AF: 4
0
AF
At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.
The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don’t see how they quickly lead to solutions to AI alignment if there is a problem to solve.
- johnswentworth 10 Oct 2021 16:56 UTC
  LW: 5 AF: 4
  0
  AF Parent
  The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is “we would need more assumptions”. (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong—like utility functions.)
  That’s one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don’t even understand what kind-of-thing we’re trying to align AI with.
  (Side-note: this is also the main value which I think the Natural Abstraction Hypothesis offers: it directly tackles the Pointers Problem, and tells us what the “input variables” are for human values.)
  Taking a different angle: if we’re concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don’t even understand what kind-of-thing we’re supposed to look for.