This seems really exciting, and I’d love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don’t endorse thinking of everything as Goodhart’s law, despite that paper—though I still think it’s technically true, it’s not as useful as I had hoped.)
This seems really exciting, and I’d love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don’t endorse thinking of everything as Goodhart’s law, despite that paper—though I still think it’s technically true, it’s not as useful as I had hoped.)