If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective
If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
So I’m leaning towards deception being unlikely by default.
But I’m very much an ML noob, so I could change my mind after learning more.
I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
Deceptive alignment already requires highly non-trivial prerequisites:
Strong coherence/goal directedness
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective
If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
So I’m leaning towards deception being unlikely by default.
But I’m very much an ML noob, so I could change my mind after learning more.
See this more recent analysis on the likelihood of deceptive alignment.
Oh wow, it’s long.
I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
See here.