I think RLHF probably doesn’t particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.
And this competence will decompose into:
Actually doing the task well
Making it look like the task was done well even when it wasn’t
The second one isn’t a capability we want, but it doesn’t particularly look like persuasiveness IMO.
I think RLHF probably doesn’t particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.
And this competence will decompose into:
Actually doing the task well
Making it look like the task was done well even when it wasn’t
The second one isn’t a capability we want, but it doesn’t particularly look like persuasiveness IMO.