RobertKirk comments on Sticky goals: a concrete experiment for understanding deceptive alignment

RobertKirk 8 Sep 2022 8:52 UTC
LW: 3 AF: 3
2
AF
When you say “the knowledge of what our goals are should be present in all models”, by “knowledge of what our goals are” do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:

The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn’t need to have it hardcoded, whereas the corrigible model needs it to be hardcoded

I guess I don’t understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn’t have a hard-coded pointer to what our goals are. I’d imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.

(I realise I’ve been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don’t think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)

To try and run my (probably inaccurate) simulation of you: I imagine you don’t think that’s a contradiction above. So you’d think that “knowledge of what our goals are” doesn’t mean a pointer to our goals in all the AI’s world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it’s optimisation process), but wouldn’t enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it’s the world model, and probably using this simpler information about our goals in some way). I’m struggling to imagine what that would look like.
- RobertKirk 3 Oct 2022 13:06 UTC
  LW: 4 AF: 4
  3
  AF Parent
  I’ve now had a conversation with Evan where he’s explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it’s likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you’d need to fully encode the training objective.