tailcalled comments on A Certain Formalization of Corrigibility Is VNM-Incoherent

tailcalled 20 Nov 2021 15:47 UTC
1 point
AF
Imagine that policies decompose into two components, $π = ρ \otimes σ$ . For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.
Suppose, for instance, that $ρ$ is such that the policy just ends up acting in a completely random-twitching way. Technically $σ$ has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the features $f$ , $σ$ is basically constant. This is a low power situation, and if one actually specified what $f$ would be, then a TurnTrout-style argument could probably prove that such values of $ρ$ would be avoided for power-seeking reasons. On the other hand, if $ρ$ made the policy act like an optimizer which optimizes a utility function over the features of $f$ with the utility function being specified by $σ$ , then that would lead to a lot more power/injectivity.
On the other hand, I wonder if there’s a limit to this style of argument. Too much noninjectivity would require crazy interaction effects to fill out the space in a Hilbert-curve-style way, which would be hard to optimize?
- tailcalled 20 Nov 2021 16:23 UTC
  LW: 1 AF: 1
  AF Parent
  Actually upon thinking further I don’t think this argument works, at least not as it is written right now.