I notice that the word ‘corrigibility’ doesn’t appear once here! Framing 3 (online misalignment) seems to be in the close vicinity:
policies’ goals change easily in response to additional reward feedback … [vs] policies’ goals are very robust to additional reward feedback
I think the key distinction is that in your description the (goal-affecting) online learning process is sort of ‘happening to’ the AI, while corrigibility is accounting for the AI instance(s) response(s) to the very presence and action of such a goal-affecting process.
The upshot is pretty similar though: if the goal-affecting online updates are too slow, or the AI too incorrigible to apply much/any updating to, we get an alignment failure, especially if we’re in a fast/high-stakes setting.
I notice that the word ‘corrigibility’ doesn’t appear once here! Framing 3 (online misalignment) seems to be in the close vicinity:
I think the key distinction is that in your description the (goal-affecting) online learning process is sort of ‘happening to’ the AI, while corrigibility is accounting for the AI instance(s) response(s) to the very presence and action of such a goal-affecting process.
The upshot is pretty similar though: if the goal-affecting online updates are too slow, or the AI too incorrigible to apply much/any updating to, we get an alignment failure, especially if we’re in a fast/high-stakes setting.
Incidentally, I think the ‘high stakes’ setting corresponds to rapidity in my tentative un-unpluggability taxonomy