CarlShulman comments on A toy model of the control problem

CarlShulman 17 Sep 2015 3:02 UTC
2 points
Of course, with this model it’s a bit of a mystery why A gave B a reward function that gives 1 per block, instead of one that gives 1 for the first block and a penalty for additional blocks. Basically, why program B with a utility function so seriously out of whack with what you want when programming one perfectly aligned would have been easy?
- Kyre 17 Sep 2015 5:42 UTC
  11 points
  Parent
  It’s a trade-off. The example is simple enough that the alignment problem is really easy to see, but it also means that it is easy to shrug it off and say “duh, just the use obvious correct utility function for B”.
  
  Perhaps you could follow it up with an example with more complex mechanics (and or more complex goal for A) where the bad strategy for B is not so obvious. You then invite the reader to contemplate the difficulty of the alignment problem as the complexity approaches that of the real world.
- Stuart_Armstrong 17 Sep 2015 6:34 UTC
  8 points
  Parent
  Maybe the easiest way of generalising this is programming B to put 1 block in the hole, but, because B was trained in a noisy environment, it gives only a 99.9% chance of the block being in the hole if it observes that. Then six blocks in the hole is higher expected utility, and we get the same behaviour.
  - CarlShulman 17 Sep 2015 18:02 UTC
    2 points
    Parent
    That still involves training it with no negative feedback error term for excess blocks (which would overwhelm a mere 0.1% uncertainty).
    - Stuart_Armstrong 18 Sep 2015 12:01 UTC
      2 points
      Parent
      This is supposed to be a toy model of excessive simplicity. Do you have suggestions for improving it (for purposes of presenting to others)?
      - CarlShulman 18 Sep 2015 15:31 UTC
        2 points
        Parent
        Maybe explain how it works when being configured, and then stops working when B gets a better model of the situation/runs more trial-and-error trials?
        Stuart_Armstrong 18 Sep 2015 15:56 UTC
        0 points
        Parent
        Ok.
- Eliezer Yudkowsky 18 Sep 2015 19:57 UTC
  4 points
  Parent
  I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.