In section 2.1 of the Indifference paper the reward function is defined on histories. In section 2 of the corrigibility paper, the utility function is defined over (action1, observation, action2) triples—which is to say, complete histories of the paper’s three-timestep scenario. And section 2 of the interruptibility paper specifies a reward at every timestep.
I think preferences-over-future-states might be a simplification used in thought experiments, not an actual constraint that has limited past corrigibility approaches.
Interesting, thanks! Serves me right for not reading the “Indifference” paper!
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist. (I didn’t ask, I’m just going by what they wrote.) I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it’s reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable). That’s what “My corrigibility proposal sketch” was trying to offer.
I’ll reword to try to describe the situation better, thanks again.
In section 2.1 of the Indifference paper the reward function is defined on histories. In section 2 of the corrigibility paper, the utility function is defined over (action1, observation, action2) triples—which is to say, complete histories of the paper’s three-timestep scenario. And section 2 of the interruptibility paper specifies a reward at every timestep.
I think preferences-over-future-states might be a simplification used in thought experiments, not an actual constraint that has limited past corrigibility approaches.
Interesting, thanks! Serves me right for not reading the “Indifference” paper!
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist. (I didn’t ask, I’m just going by what they wrote.) I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it’s reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable). That’s what “My corrigibility proposal sketch” was trying to offer.
I’ll reword to try to describe the situation better, thanks again.