Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.
At first this claim seemed kind of wild, but there’s a version of it I agree with.
It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that’s a pointer to some optimizer “outside” it, it’s underspecified what it should point to. In the evolution → humans → gradient descent → model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn’t seem to be different between the RLO and steered optimization stories.
I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer’s goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.
Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯
BTW this is an old post; see my more up-to-date discussion here, esp. Posts 8–10.
At first this claim seemed kind of wild, but there’s a version of it I agree with.
It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that’s a pointer to some optimizer “outside” it, it’s underspecified what it should point to. In the evolution → humans → gradient descent → model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn’t seem to be different between the RLO and steered optimization stories.
I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer’s goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.
Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯
BTW this is an old post; see my more up-to-date discussion here, esp. Posts 8–10.