zulupineapple comments on Against Instrumental Convergence

zulupineapple 27 Jan 2018 20:51 UTC
3 points
There is a difference between an AI that does X, and an AI that has a goal to do X. Not sure what architectures you’re referring to, but I suspect you may be conflating the goals of the AI’s constructor (or construction process) with the goals of the AI.
It’s almost certainly true that a random program from the set of programs that do X, is more dangerous than a random program from some larger set, however I claim that the fraction of dangerous programs is still pretty small.
Now, there are some obviously dangerous sets of programs, and it’s possible that humans will pick an AI form such set. In other news, if you shoot yourself in the foot, you get hurt.
- Paperclip Minimizer 6 May 2018 11:49 UTC
  1 point
  Parent
  Not sure what architectures you’re referring to
  I think habryka is thinking about modern machine learning architectures that are studied by AI researchers. AI research is in fact a distinct subfield from programming-in-general, because AI programs are in fact a distinct subgroup from programs-in-general.
  - zulupineapple 6 May 2018 12:49 UTC
    1 point
    Parent
    I’m very much aware what architectures machine learning studies, and indeed it (usually) isn’t generic programs in the sense of raw instruction lists (although, any other Turing complete set of programs can be perfectly well called “programs-in-general”—instruction lists are in no way unique).
    The problem is that everyone’s favorite architecture—the plain neural network—does not contain a utility function. It is built using a utility/cost function, but that’s very different.
    - Paperclip Minimizer 6 May 2018 13:06 UTC
      1 point
      Parent
      This doesn’t make the AI any safer, given the whole (neural network + neural network builder¹) is still a misaligned AI. Real life examples happen all the time.
      
      ¹: I don’t know if there is a standard term for this.
      - zulupineapple 6 May 2018 14:40 UTC
        2 points
        Parent
        I never said anything about (mis)alignment. Of course using stupid training rewards will result in stupid programs. But the problem with those programs is not that they are instrumentally convergent (see the title of my post).
        The training program, which has the utility function, does exhibit convergence, but the resulting agent, which has no utilities, does not usually exhibit it. E.g. if training environment involves a scenario where the agent is turned off (which results in 0 utility), then the training program will certainly build an agent that resists being turned off. But if the training environment does not include that scenario, then the resulting agent is unlikely to resist it.
        Paperclip Minimizer 7 May 2018 10:36 UTC
        1 point
        Parent
        The problem is that “the training environment does not include that scenario” is far from guaranteed.
        zulupineapple 7 May 2018 11:41 UTC
        1 point
        Parent
        Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
        Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
        Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?