I’ve recently put out a work on changing and influenceable reward functions which is highly related to the questions you discuss here: I see it as a formalization of some of the ways in which humans are not self-aligned (their preferences and reward feedback change and can be influenced by AI systems), and a discussion of how current alignment techniques fail (plus a discussion of how any alignment technique may run into challenges in dealing with these problems).
I think the idea of trying to align to meta-preferences has some promise to work for “most of the way” (although it eventually seems to run into the same conceptual limitations as preferences). I personally see it as more viable than a “long reflection” or operationalizing CEV safely.
I’ve recently put out a work on changing and influenceable reward functions which is highly related to the questions you discuss here: I see it as a formalization of some of the ways in which humans are not self-aligned (their preferences and reward feedback change and can be influenced by AI systems), and a discussion of how current alignment techniques fail (plus a discussion of how any alignment technique may run into challenges in dealing with these problems).
I think the idea of trying to align to meta-preferences has some promise to work for “most of the way” (although it eventually seems to run into the same conceptual limitations as preferences). I personally see it as more viable than a “long reflection” or operationalizing CEV safely.