Chris_Leong comments on The case for ensuring that powerful AIs are controlled

Chris_Leong 27 Jan 2024 6:49 UTC
LW: 4 AF: 1
0
AF
Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?
- ryan_greenblatt 27 Jan 2024 17:08 UTC
  LW: 6 AF: 4
  2
  AF Parent
  I think RLHF probably doesn’t particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.
  
  And this competence will decompose into:
  - Actually doing the task well
  - Making it look like the task was done well even when it wasn’t
  The second one isn’t a capability we want, but it doesn’t particularly look like persuasiveness IMO.