JBlack comments on What’s Up With Confusingly Pervasive Goal Directedness?

JBlack 21 Jan 2022 6:31 UTC
4 points
I think the problem is not quite so binary as “good/bad”. It seems to be more effective vs ineffective and beneficial vs harmful.
The problem is that effective plans are more likely to be harmful. We as a species have already done a lot of optimization in a lot of dimensions that are important to us, and the most highly effective plans almost certainly have greater side effects that make thing worse in dimensions that we aren’t explicitly telling the optimizer to care about.
It’s not so much that there’s a direct link between sparsity of effective plans and likelihood of bad outcomes, as that more complex problems (especially dealing with the real world) seem more likely to have “spurious” solutions that technically meet all the stated requirements, but aren’t what we actually want. The beneficial effective plans become sparse faster than the harmful effective plans, simply because in a more complex space there are more ways to be unexpectedly harmful than good.
- jsteinhardt 21 Jan 2022 6:38 UTC
  5 points
  Parent
  Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won’t scale because you have 10^50, not 100, bad solutions for each good one. That’s the part that seems wrong to me.
  - TurnTrout 21 Jan 2022 16:45 UTC
    13 points
    Parent
    That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:
    If the agent flips the first bit, it’s locked into a single trajectory. None of its actions matter anymore.
    But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce $(n \times n)^{T - 1}$ observation histories. If $n = 100$ and $T = 50$ , then that’s $(100 \times 100)^{49} = 10^{196}$ observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.
    And indeed, we can apply the scaling law for instrumental convergence to conclude that for every u-OH, at least $\frac{10^{196}}{10^{196} + 1}$ of its permuted variants (weakly) prefer flipping the second pixel at $t = 1$ , over flipping the first pixel at $t = 1$ .
    $\frac{10^{196}}{10^{196} + 1} .$
    Choose any atom in the universe. Uniformly randomly select another atom in the universe. It’s about $10^{117}$ times more likely that these atoms are the same, than that a utility function incentivizes “dying” instead of flipping pixel 2 at $t = 1$ .
    (For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)
  - JBlack 22 Jan 2022 1:25 UTC
    1 point
    Parent
    Your “harmfulness” criteria will always have some false negative rate.
    If you incorrectly classify a harmful plan as beneficial one time in a million, in the former case you’ll get 10^44 plans that look good but are really harmful for every one that really is good. In the latter case you get 10000 plans that are actually good for each one that is harmful.
    - jsteinhardt 23 Jan 2022 0:30 UTC
      3 points
      0
      Parent
      This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.
      - JBlack 24 Jan 2022 0:18 UTC
        4 points
        1
        Parent
        Combining multiple source of information, double checking etc are ways to decrease error probability, certainly. The problem is that they’re not independent. For highly complex spaces not only does the number of additional checks you need increase super-linearly, but the number of types of checks you need likely possibly also increases super-linearly.
        That’s my intuition, at least.
- Vladimir_Nesov 21 Jan 2022 12:30 UTC
  2 points
  Parent
  
  beneficial effective plans become sparse faster than the harmful effective plans
  
  The constants are more important than the trend here, whether a good plan for a pivotal act that sorts out AI risk in the medium term can be found. Discrimination of good plans only has to be improved enough to overcome the threshold of what’s needed to search plans effective enough to solve that problem.