Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.
&
Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer’s epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.
These claims are not very intuitive for me.
It’s not particularly intuitive to me why the selection pressure the base optimiser applies (selection pressure applied towards performance on the base objective) would induce crystallisation of the mesa objective instead of the base objective. Especially if said selection pressure is applied during the joint optimisation regime.
Is joint optimisation not assumed to precede deceptive alignment?
Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.
This is suboptimal wrt base optimisation.
Mesa-optimisers that only optimised the base objective would perform better, so we’d expect crystallisation to only optimising base objective.
Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That’s where I’m lost.
I don’t think the answer is necessarily as simple as “mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive”.
SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can’t well shield parts of itself from updates.
Why would SGD privilege crystallised deception over internalising the base objective?
Or is it undecided which objective is favoured?
I think that step is where you lose me/I don’t follow their argument very well.
You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don’t see it?
[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that’s not how SGD works/I’m missing something basic/other “I’m dumb” explanations.]
If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective
If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
So I’m leaning towards deception being unlikely by default.
But I’m very much an ML noob, so I could change my mind after learning more.
I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
&
These claims are not very intuitive for me.
It’s not particularly intuitive to me why the selection pressure the base optimiser applies (selection pressure applied towards performance on the base objective) would induce crystallisation of the mesa objective instead of the base objective. Especially if said selection pressure is applied during the joint optimisation regime.
Is joint optimisation not assumed to precede deceptive alignment?
Highlighting my confusion more.
Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.
This is suboptimal wrt base optimisation.
Mesa-optimisers that only optimised the base objective would perform better, so we’d expect crystallisation to only optimising base objective.
Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That’s where I’m lost.
I don’t think the answer is necessarily as simple as “mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive”.
SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can’t well shield parts of itself from updates.
Why would SGD privilege crystallised deception over internalising the base objective?
Or is it undecided which objective is favoured?
I think that step is where you lose me/I don’t follow their argument very well.
You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don’t see it?
[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that’s not how SGD works/I’m missing something basic/other “I’m dumb” explanations.]
If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
Deceptive alignment already requires highly non-trivial prerequisites:
Strong coherence/goal directedness
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective
If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
So I’m leaning towards deception being unlikely by default.
But I’m very much an ML noob, so I could change my mind after learning more.
See this more recent analysis on the likelihood of deceptive alignment.
Oh wow, it’s long.
I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
See here.