I don’t have a good term for that, unfortunately—if you’re trying to build an aligned AI, “human values” could be the right term, though in most cases you really just want “move one strawberry onto a plate without killing everyone,” which is quite a lot less than “optimize for all human values.” I could see how meta-objective might make sense if you’re thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.
Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the “classical” alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you’re good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.
I don’t have a good term for that, unfortunately—if you’re trying to build an aligned AI, “human values” could be the right term, though in most cases you really just want “move one strawberry onto a plate without killing everyone,” which is quite a lot less than “optimize for all human values.” I could see how meta-objective might make sense if you’re thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.
Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the “classical” alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you’re good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.
Got it, that’s helpful. Thank you!