micahcarroll comments on Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

micahcarroll 7 Nov 2024 19:01 UTC
1 point
0
User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.
- Kola Ayonrinde 7 Nov 2024 19:48 UTC
  1 point
  0
  Parent
  Ahh sorry, I think I made this comment on an early draft of this post and didn’t realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has!
  
  It looks like I can’t delete a comment which has a reply so I’ll add a note to reflect this.
  
  Anyways, loved the paper—very cool research!