jsteinhardt comments on What’s Up With Confusingly Pervasive Goal Directedness?

jsteinhardt 21 Jan 2022 6:38 UTC
5 points
Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won’t scale because you have 10^50, not 100, bad solutions for each good one. That’s the part that seems wrong to me.
- TurnTrout 21 Jan 2022 16:45 UTC
  13 points
  Parent
  That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:
  If the agent flips the first bit, it’s locked into a single trajectory. None of its actions matter anymore.
  But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce $(n \times n)^{T - 1}$ observation histories. If $n = 100$ and $T = 50$ , then that’s $(100 \times 100)^{49} = 10^{196}$ observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.
  And indeed, we can apply the scaling law for instrumental convergence to conclude that for every u-OH, at least $\frac{10^{196}}{10^{196} + 1}$ of its permuted variants (weakly) prefer flipping the second pixel at $t = 1$ , over flipping the first pixel at $t = 1$ .
  $\frac{10^{196}}{10^{196} + 1} .$
  Choose any atom in the universe. Uniformly randomly select another atom in the universe. It’s about $10^{117}$ times more likely that these atoms are the same, than that a utility function incentivizes “dying” instead of flipping pixel 2 at $t = 1$ .
  (For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)
- JBlack 22 Jan 2022 1:25 UTC
  1 point
  Parent
  Your “harmfulness” criteria will always have some false negative rate.
  If you incorrectly classify a harmful plan as beneficial one time in a million, in the former case you’ll get 10^44 plans that look good but are really harmful for every one that really is good. In the latter case you get 10000 plans that are actually good for each one that is harmful.
  - jsteinhardt 23 Jan 2022 0:30 UTC
    3 points
    0
    Parent
    This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.
    - JBlack 24 Jan 2022 0:18 UTC
      4 points
      1
      Parent
      Combining multiple source of information, double checking etc are ways to decrease error probability, certainly. The problem is that they’re not independent. For highly complex spaces not only does the number of additional checks you need increase super-linearly, but the number of types of checks you need likely possibly also increases super-linearly.
      That’s my intuition, at least.