I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.
Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.
It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow without matching it to a friendly base-objective. That is: to align the AI directly to its human operator, instead of aligning the AI to the reward, and the reward to the human.
I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.
Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.
Quote from here