Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.
This is suboptimal wrt base optimisation.
Mesa-optimisers that only optimised the base objective would perform better, so we’d expect crystallisation to only optimising base objective.
Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That’s where I’m lost.
I don’t think the answer is necessarily as simple as “mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive”.
SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can’t well shield parts of itself from updates.
Why would SGD privilege crystallised deception over internalising the base objective?
Or is it undecided which objective is favoured?
I think that step is where you lose me/I don’t follow their argument very well.
You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don’t see it?
[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that’s not how SGD works/I’m missing something basic/other “I’m dumb” explanations.]
If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective
If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
So I’m leaning towards deception being unlikely by default.
But I’m very much an ML noob, so I could change my mind after learning more.
I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
Highlighting my confusion more.
Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.
This is suboptimal wrt base optimisation.
Mesa-optimisers that only optimised the base objective would perform better, so we’d expect crystallisation to only optimising base objective.
Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That’s where I’m lost.
I don’t think the answer is necessarily as simple as “mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive”.
SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can’t well shield parts of itself from updates.
Why would SGD privilege crystallised deception over internalising the base objective?
Or is it undecided which objective is favoured?
I think that step is where you lose me/I don’t follow their argument very well.
You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don’t see it?
[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that’s not how SGD works/I’m missing something basic/other “I’m dumb” explanations.]
If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
Deceptive alignment already requires highly non-trivial prerequisites:
Strong coherence/goal directedness
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective
If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
So I’m leaning towards deception being unlikely by default.
But I’m very much an ML noob, so I could change my mind after learning more.
See this more recent analysis on the likelihood of deceptive alignment.
Oh wow, it’s long.
I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
See here.