User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.
Ahh sorry, I think I made this comment on an early draft of this post and didn’t realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has!
It looks like I can’t delete a comment which has a reply so I’ll add a note to reflect this.
User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.
Ahh sorry, I think I made this comment on an early draft of this post and didn’t realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has!
It looks like I can’t delete a comment which has a reply so I’ll add a note to reflect this.
Anyways, loved the paper—very cool research!