Davidmanheim comments on AI learns betrayal and how to avoid it

Davidmanheim 2 Oct 2021 17:47 UTC
LW: 4 AF: 3
AF
This seems really exciting, and I’d love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don’t endorse thinking of everything as Goodhart’s law, despite that paper—though I still think it’s technically true, it’s not as useful as I had hoped.)