DragonGod comments on Deceptive Alignment

DragonGod 27 Feb 2023 23:14 UTC
2 points
Highlighting my confusion more.

Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.

This is suboptimal wrt base optimisation.

Mesa-optimisers that only optimised the base objective would perform better, so we’d expect crystallisation to only optimising base objective.

Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That’s where I’m lost.

I don’t think the answer is necessarily as simple as “mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive”.

SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can’t well shield parts of itself from updates.

Why would SGD privilege crystallised deception over internalising the base objective?

Or is it undecided which objective is favoured?

I think that step is where you lose me/I don’t follow their argument very well.

You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don’t see it?

[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that’s not how SGD works/I’m missing something basic/other “I’m dumb” explanations.]
- DragonGod 27 Feb 2023 23:17 UTC
  2 points
  Parent
  If you’re not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don’t find it very persuasive/compelling/interesting?
  
  Deceptive alignment already requires highly non-trivial prerequisites:
  - Strong coherence/goal directedness
  - Horizons that stretch across training episodes/parameter updates or considerable lengths of time
  - High situational awareness
  - Conceptualisation of the base objective
  If when those prerequisites are satisfied, you’re just saying “deceptive alignment is something that can happen” instead of “deceptive alignment is likely to happen”, then I don’t know why I should care?
  
  If deception isn’t selected for/or likely by default provided its prerequisites are satisfied, then I’m not sure why deceptive alignment is something that deserves attention.
  
  Though I do think deceptive alignment would deserve attention if we’re ambivalent between selection for deception and selection fot alignment.
  
  My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?
  
  So I’m leaning towards deception being unlikely by default.
  
  But I’m very much an ML noob, so I could change my mind after learning more.
  - evhub 28 Feb 2023 0:09 UTC
    4 points
    Parent
    See this more recent analysis on the likelihood of deceptive alignment.
    - DragonGod 28 Feb 2023 0:58 UTC
      2 points
      Parent
      Oh wow, it’s long.
      
      I can’t consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.
      
      I plan to just listen to an AI narration of the post a few times, but since it’s a transcript of a talk, I’d appreciate a link to the original talk if possible.
      - evhub 28 Feb 2023 1:06 UTC
        4 points
        Parent
        See here.