Jalex Stark comments on Alignment Problems All the Way Down

Jalex Stark 22 Jan 2022 5:05 UTC
5 points
I think recursive mesa-optimization doesn’t mean that aligning mesa-optimizers is hopeless. It does imply that solutions to inner alignment should be “hereditary.” The model should have a “spawn a mesa-optimizer” technique which aligns the mesa-optimizer with the base objective and teaches the new submodel about the technique.
- peterbarnett 22 Jan 2022 15:01 UTC
  1 point
  Parent
  I think that this is a possible route to take, I don’t think we currently have a good enough understanding of how to control\align mesa-optimizers to be able to do this.
  I worry a bit that even if we correctly align a mesa-optimizer (make its mesa-objective aligned with the base-objective), its actual optimization process might be faulty/misaligned and so it would mistakenly spawn a misaligned optimizer. I think this faulty optimization process is most likely to happen sometime in the middle of training, where the mesa-optimizer is able to make another optimizer but not smart enough to control it.
  But if we develop a better theory of optimization and agency, I could see it being safe to allow mesa-optmizers to create new optimizers. I currently think we are nowhere near that stage, especially considering that we don’t seem to have great definitons of either optimization or agency.
  - Jalex Stark 22 Jan 2022 16:29 UTC
    2 points
    Parent
    I agree with you that “hereditary mesa-alignment” is hard. I just don’t think that “avoid all mesa-optimization” is a priori much easier.