Soroush Pour comments on Deep Deceptiveness

Soroush Pour 29 Mar 2023 4:23 UTC
2 points
1
No comment on this being an accurate take on MIRI’s worldview or not, since I am not an expert there. I wanted to ask a separate question related to the view described here:

> “With gradient descent, maybe you can learn enough to train your AI for things like “corrigibility” or “not being deceptive”, but really what you’re training for is “Don’t optimise for the goal in ways that violate these particular conditions”.”

On this point, it seems that we create a somewhat arbitrary divide between corrigibility & deception on one side and all other goals of the AI on the other.

The AI is trained to minimise some loss function, of which non-corrigibility and deception are penalised, so wouldn’t be more accurate to say the AI actually has a set of goals which include corrigibility and non-deception?

And if that’s the case, I don’t think it’s as fair to say that the AI is trying to circumvent corrigibility and non-deception, so much as it is trying to solve a tough optimisation problem that includes corrigibility, non-deception, and all other goals.

If the above is correct, then I think this is a reason to be more optimistic about the alignment problem—our agent is not trying to actively circumvent our goals, but instead trying to strike a hard balance of achieving all of them including important safety aspects like corrigibility and non-deception.
Now, it is possible that instrumental convergence puts certain training signals (e.g. corrigibility) at odds with certain instrumental goals of agents (e.g. self preservation). I do believe this is a real problem and poses alignment risk. But it’s not obvious to me that we’ll see agents universally ignore their safety feature training signals in pursuit of instrumental goals.
- Jay Bailey 3 Apr 2023 8:42 UTC
  1 point
  0
  Parent
  Sorry it took me a while to get to this.
  
  Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly “pass” the course, you would.
  
  I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things “goals” (the things you really want for their own sake) and “conditions” (the things you need to do for some external reason)
  
  My inner MIRI says—we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.