Lumifer comments on An Oracle standard trick

Lumifer 4 Jun 2015 14:40 UTC
0 points
Are we talking about an AI which has recursively self-improved?
- ike 4 Jun 2015 14:43 UTC
  0 points
  Parent
  I don’t think that should matter, unless the improvement causes them to realize they do have an effect.
  
  The point is that this is a failsafe in case something goes wrong.
  
  (Or that’s how I understood the proposal.)
  
  Personally, I doubt it would work, because the AI should be able to see that you’ve programmed it that way. You need to outsmart the AI, which is similar to boxing it and telling it it’s not boxed.
  - Lumifer 4 Jun 2015 15:13 UTC
    2 points
    Parent
    The issue with the self-modifying AI is precisely that “it was programmed to do that” stops being a good answer.
    - Stuart_Armstrong 5 Jun 2015 11:51 UTC
      2 points
      Parent
      The “act as if it doesn’t believe its messages will be read” is part of its value function, not its decision theory. So we are only requiring the value function to be stable over self improvement.
      - Lumifer 5 Jun 2015 14:31 UTC
        0 points
        Parent
        Why is that? The value function tells you what is important, but the “act” part requires decision theory.
        Stuart_Armstrong 5 Jun 2015 16:17 UTC
        0 points
        Parent
        What I mean is that I haven’t wired the decision theory to something odd (which might be removed by self improvement), just chosen a particular value system (which has much higher chance of being preserved by self improvement).
    - ike 4 Jun 2015 17:43 UTC
      0 points
      Parent
      It’s supposed to keep that part of its programming. If we could rely on that, we wouldn’t need any control. But we’re worried it has changed, so we build in data which makes the AI think it won’t have any control on the world, so even if it messes up it should at least not try to manipulate us.
      - Lumifer 4 Jun 2015 18:19 UTC
        −2 points
        Parent
        Right, so we have an AI which (1) is no longer constrained by its original programming; and (2) believes no one ever reads its messages. And thus we get back to my question: why such an AI would bother to send any messages at all?
        Stuart_Armstrong 5 Jun 2015 11:53 UTC
        0 points
        Parent
        The design I had in mind is: utility u causes the AI to want to send messages. This is modified to u’ so that it also acts as if it believed the message wasn’t read (note this doesn’t mean that it believes it!). Then if u’ remains stable under self-improvement, we have the same behaviour after self-improvement.
        Lumifer 5 Jun 2015 14:35 UTC
        0 points
        Parent
        
        it also acts as if it believed the message wasn’t read (note this doesn’t mean that it believes it!)
        
        So… you want to introduce, as a feature, the ability to believe one thing but act as if you believe something else? That strikes me as a remarkably bad idea. For one thing, people with such a feature tend to end up in psychiatric wards.
        gjm 5 Jun 2015 16:28 UTC
        0 points
        Parent
        I haven’t thought hard about Stuart’s ideas, so this may or may not have any relevance to them; but it’s at least arguable that it’s really common (even outside psychiatric wards) for explicit beliefs and actions to diverge. A standard example: many Christians overtly believe that when Christians die they enter into a state of eternal infinite bliss, and yet treat other people’s deaths as tragic and try to avoid dying themselves.
        Stuart_Armstrong 5 Jun 2015 16:17 UTC
        0 points
        Parent
        Have you read the two article I linked to, explaining the general principle?
        Lumifer 5 Jun 2015 16:53 UTC
        0 points
        Parent
        Yes, though I have not thought deeply (hat tip to Jonah :-D) about them.
        
        The idea of decoupling AI beliefs from AI actions looks bad to me on its face. I expect it to introduce a variety of unpleasant failure modes (“of course I fully believe in CEV, it’s just that I’m going to act differently...”) and general fragility. And even if one of utility functions is “do not care about anything but miracles” I still think it’s just going to lead to a catatonic state, is all.
        ike 4 Jun 2015 18:26 UTC
        0 points
        Parent
        In that case, you expect it to send no messages.
        
        This strategy is supposed to make it that instead of failing by sending bad messages, its failure mode is by just shutting down.
        
        If all works well, it answers normally, and if it doesn’t work, it doesn’t do anything because it expects nobody will listed. As opposed to an oracle that, if it messes up its own programming, will try to manipulate people with its answers.
        Lumifer 4 Jun 2015 19:19 UTC
        −2 points
        Parent
        Well, yes, except that you can have a perfectly good entirely Friendly AI which just shuts down because nobody listens, so why bother?
        
        You’re not testing for Friendliness, you’re testing for the willingness to continue the irrational waste of bits and energy.
        Silver_Swift 5 Jun 2015 13:03 UTC
        0 points
        Parent
        False positives are vastly better than false negatives when testing for friendliness though. In the case of an oracle AI, friendliness includes a desire to answer questions truthfully regardless of the consequences to the outside world.
        Lumifer 5 Jun 2015 14:37 UTC
        1 point
        Parent
        
        friendliness includes a desire to answer questions
        
        Which definition of Friendliness are you referring to? I have a feeling you’re treating Friendliness as a sack into which you throw whatever you need at the moment...
        Silver_Swift 8 Jun 2015 13:49 UTC
        2 points
        Parent
        Fair enough, let me try to rephrase that without using the word friendliness:
        
        We’re trying to make a superintelligent AI that answers all of our questions accurately but does not otherwise influence the world and has no ulterior motives beyond correctly answering questions that we ask of it.
        
        If we instead accidentally made an AI that decides that it is acceptable to (for instance) manipulate us into asking simpler question so that it can answer more of them, it is preferable that it doesn’t believe anyone is listening to the answers it gives because that is one less way it has for interacting with the outside world.
        
        It is a redundant safeguard. With it, you might end up with a perfectly functioning AI that does nothing, without it, you may end up with an AI that is optimizing the world in an uncontrolled manner.
        Lumifer 8 Jun 2015 14:55 UTC
        0 points
        Parent
        
        it is preferable that it doesn’t believe anyone is listening to the answers it gives
        
        I don’t think so. As I mentioned in another subthread here, I consider separating what an AI believes (e.g. that no one is listening) from what it actually does (e.g. answer questions) to be a bad idea.