the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)
but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.
Re Evan R. Murphy’s comment about confusingness: “model agrees with that more” definitely clarifies it, but I wonder if Evan was expecting something like “more right is more of the scary thing” for each metric (which was my first glance hypothesis).
Thanks, ‘scary thing always on the right’ would be a nice bonus. But evhub cleared up that particular confusion I had by saying that further to the right is always ’model agrees with that more.
No, further to the right is more corrigible. Further to the right is always “model agrees with that more.”
I’m still a bit confused. Section 5.4 says
but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.
Re Evan R. Murphy’s comment about confusingness: “model agrees with that more” definitely clarifies it, but I wonder if Evan was expecting something like “more right is more of the scary thing” for each metric (which was my first glance hypothesis).
Thanks, ‘scary thing always on the right’ would be a nice bonus. But evhub cleared up that particular confusion I had by saying that further to the right is always ’model agrees with that more.