Jay Bailey comments on Deep Deceptiveness

Jay Bailey 26 Mar 2023 7:14 UTC
3 points
0
I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I’ve also been going through the 2021 MIRI dialogues, so perhaps that’s been building some understanding under the surface)
After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:
In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is “optimisation”. We can think of this as aiming towards a goal and steering towards it—moving the world closer to what you want it to be. These aren’t things we train for—we don’t even know how to train for them, or against them. They’re just the way the world works. If you want to build a power plant, you need some way of getting energy to turn into electricity. If you want to achieve a task, you need some way of selecting a target and then navigating towards it, whether it be walking across the room to grab a cookie, or creating a Dyson sphere.
With gradient descent, maybe you can learn enough to train your AI for things like “corrigibility” or “not being deceptive”, but really what you’re training for is “Don’t optimise for the goal in ways that violate these particular conditions”. This does not stop it from being an optimisation problem. The AI will then naturally, with no prompting, attempt to find the best path that gets around these limitations. This probably means you end up with a path that gets the AI the things it wanted from the useful properties of deception or non-corrigibility while obeying the strict letter of the law. (After all, if deception / non-corrigibility wasn’t useful, if it didn’t help achieve something, you would not spontaneously get an agent that did this, without training it to do so) Again, this is an entirely natural thing. The shortest path between two points is a straight line. If you add an obstacle in the way, the shortest path between those two points is now to skirt arbitrarily close to the obstacle. No malice is required, any more than you are maliciously circumventing architects when you walk close to (but not walking into!) walls.
Basically—if the output of an optimisation process is dangerous, it doesn’t STOP being dangerous by changing it into a slightly different optimisation process of “Achieve X thing (which is dangerous) without doing Y (which is supposed to trigger on dangerous things)”. You just end up getting X through Y’ instead, as long as you’re still enacting the basic pattern—which you will be, because an AI that can’t optimise things can’t do anything at all. If you couldn’t apply a general optimisation process, you’d be unable to walk across the room and get a cookie, let alone do all the things you do in your day-to-day life. Same with the AI.
I’d be interested in whether someone who understands MIRI’s worldview decently well thinks I’ve gotten this right. I’m holding off on trying to figure out what I think about that worldview for now—I’m still in the understanding phase.
- Soroush Pour 29 Mar 2023 4:23 UTC
  2 points
  1
  Parent
  No comment on this being an accurate take on MIRI’s worldview or not, since I am not an expert there. I wanted to ask a separate question related to the view described here:
  
  > “With gradient descent, maybe you can learn enough to train your AI for things like “corrigibility” or “not being deceptive”, but really what you’re training for is “Don’t optimise for the goal in ways that violate these particular conditions”.”
  
  On this point, it seems that we create a somewhat arbitrary divide between corrigibility & deception on one side and all other goals of the AI on the other.
  
  The AI is trained to minimise some loss function, of which non-corrigibility and deception are penalised, so wouldn’t be more accurate to say the AI actually has a set of goals which include corrigibility and non-deception?
  
  And if that’s the case, I don’t think it’s as fair to say that the AI is trying to circumvent corrigibility and non-deception, so much as it is trying to solve a tough optimisation problem that includes corrigibility, non-deception, and all other goals.
  
  If the above is correct, then I think this is a reason to be more optimistic about the alignment problem—our agent is not trying to actively circumvent our goals, but instead trying to strike a hard balance of achieving all of them including important safety aspects like corrigibility and non-deception.
  Now, it is possible that instrumental convergence puts certain training signals (e.g. corrigibility) at odds with certain instrumental goals of agents (e.g. self preservation). I do believe this is a real problem and poses alignment risk. But it’s not obvious to me that we’ll see agents universally ignore their safety feature training signals in pursuit of instrumental goals.
  - Jay Bailey 3 Apr 2023 8:42 UTC
    1 point
    0
    Parent
    Sorry it took me a while to get to this.
    
    Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly “pass” the course, you would.
    
    I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things “goals” (the things you really want for their own sake) and “conditions” (the things you need to do for some external reason)
    
    My inner MIRI says—we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.