Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?
I think RLHF probably doesn’t particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.
And this competence will decompose into:
Actually doing the task well
Making it look like the task was done well even when it wasn’t
The second one isn’t a capability we want, but it doesn’t particularly look like persuasiveness IMO.
Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?
I think RLHF probably doesn’t particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.
And this competence will decompose into:
Actually doing the task well
Making it look like the task was done well even when it wasn’t
The second one isn’t a capability we want, but it doesn’t particularly look like persuasiveness IMO.