ryan_greenblatt comments on The case for ensuring that powerful AIs are controlled

ryan_greenblatt 27 Jan 2024 17:08 UTC
LW: 6 AF: 4
2
AF
I think RLHF probably doesn’t particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.

And this competence will decompose into:
- Actually doing the task well
- Making it look like the task was done well even when it wasn’t
The second one isn’t a capability we want, but it doesn’t particularly look like persuasiveness IMO.