I think recursive mesa-optimization doesn’t mean that aligning mesa-optimizers is hopeless. It does imply that solutions to inner alignment should be “hereditary.” The model should have a “spawn a mesa-optimizer” technique which aligns the mesa-optimizer with the base objective and teaches the new submodel about the technique.
I think that this is a possible route to take, I don’t think we currently have a good enough understanding of how to control\align mesa-optimizers to be able to do this.
I worry a bit that even if we correctly align a mesa-optimizer (make its mesa-objective aligned with the base-objective), its actual optimization process might be faulty/misaligned and so it would mistakenly spawn a misaligned optimizer. I think this faulty optimization process is most likely to happen sometime in the middle of training, where the mesa-optimizer is able to make another optimizer but not smart enough to control it.
But if we develop a better theory of optimization and agency, I could see it being safe to allow mesa-optmizers to create new optimizers. I currently think we are nowhere near that stage, especially considering that we don’t seem to have great definitons of either optimization or agency.
I think recursive mesa-optimization doesn’t mean that aligning mesa-optimizers is hopeless. It does imply that solutions to inner alignment should be “hereditary.” The model should have a “spawn a mesa-optimizer” technique which aligns the mesa-optimizer with the base objective and teaches the new submodel about the technique.
I think that this is a possible route to take, I don’t think we currently have a good enough understanding of how to control\align mesa-optimizers to be able to do this.
I worry a bit that even if we correctly align a mesa-optimizer (make its mesa-objective aligned with the base-objective), its actual optimization process might be faulty/misaligned and so it would mistakenly spawn a misaligned optimizer. I think this faulty optimization process is most likely to happen sometime in the middle of training, where the mesa-optimizer is able to make another optimizer but not smart enough to control it.
But if we develop a better theory of optimization and agency, I could see it being safe to allow mesa-optmizers to create new optimizers. I currently think we are nowhere near that stage, especially considering that we don’t seem to have great definitons of either optimization or agency.
I agree with you that “hereditary mesa-alignment” is hard. I just don’t think that “avoid all mesa-optimization” is a priori much easier.