Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce]
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.
Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.