evhub comments on Sticky goals: a concrete experiment for understanding deceptive alignment

evhub 7 Sep 2022 9:10 UTC
LW: 2 AF: 2
2
AF

I guess I’m arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.

No additional complexity is required to rederive that it should try to be deceptive—it’s a convergent instrumental goal that falls out of any generic optimization process that essentially all models should have to have anyway to be able to get good performance on tasks that require optimization.
- RobertKirk 7 Sep 2022 10:44 UTC
  LW: 1 AF: 1
  2
  AF Parent
  Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren’t aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I’m arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn’t need to rederive the content of this goal (that is, our goal) at every decision.
  Alternatively, the aligned model could use the same derivation process to be aligned: The deceptive model has some long-term goal, and in pursuing it rederives the content of the instrumental goal “do ‘what the training process incentives’”, and the alignment model has the long-term goal “do ‘what the training process incentivises’” (as a pointer/de dicto), and also rederives it with the same level of complexity. I think “do ‘what the training process incetivises’” (as a pointer/de dicto) isn’t a very complex long-term goal., and feels likely to be as complex as the arbitrary crystallised deceptive AI’s internal goal, assuming both models have full situation awareness of the training process and hence such a pointer is possible, which we’re assuming they do.
  (ETA/Meta point: I do think deception is a big issue that we definitely need more understanding of, and I definitely put weight on it being a failure of alignment that occurs in practice, but I think I’m less sure it’ll emerge (or less sure that your analysis demonstrates that). I’m trying to understand where we disagree, and whether you’ve considered the doubts I have and you possess good arguments against them or not, rather than convince you that deception isn’t going to happen.)
  - evhub 7 Sep 2022 19:29 UTC
    LW: 2 AF: 2
    2
    AF Parent
    
    Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren’t aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I’m arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn’t need to rederive the content of this goal (that is, our goal) at every decision.
    
    My argument is that the only complexity necessary to figure out what our goals are is the knowledge of what our goals are. And that knowledge is an additional complexity that should be present in all models, not just deceptive models, since all models should need to understand those sorts of facts about the world.
    - RobertKirk 8 Sep 2022 8:52 UTC
      LW: 3 AF: 3
      2
      AF Parent
      When you say “the knowledge of what our goals are should be present in all models”, by “knowledge of what our goals are” do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:
      
      The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn’t need to have it hardcoded, whereas the corrigible model needs it to be hardcoded
      
      I guess I don’t understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn’t have a hard-coded pointer to what our goals are. I’d imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.
      
      (I realise I’ve been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don’t think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)
      
      To try and run my (probably inaccurate) simulation of you: I imagine you don’t think that’s a contradiction above. So you’d think that “knowledge of what our goals are” doesn’t mean a pointer to our goals in all the AI’s world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it’s optimisation process), but wouldn’t enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it’s the world model, and probably using this simpler information about our goals in some way). I’m struggling to imagine what that would look like.
      - RobertKirk 3 Oct 2022 13:06 UTC
        LW: 4 AF: 4
        3
        AF Parent
        I’ve now had a conversation with Evan where he’s explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it’s likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you’d need to fully encode the training objective.