Models trained for HHH are likely not trained to be corrigible. Models should be trained to be corrigible too in addition to other propensities.
Corrigibility may be included in Helpfulness (alone) but when adding Harmlessness then Corrigibility conditional on being changed to be harmful is removed. So the result is not that surprising from that point of view.
Models trained for HHH are likely not trained to be corrigible. Models should be trained to be corrigible too in addition to other propensities.
Corrigibility may be included in Helpfulness (alone) but when adding Harmlessness then Corrigibility conditional on being changed to be harmful is removed. So the result is not that surprising from that point of view.