It is. It’s an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it’s to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn’t get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that’s because it took over a year.
It is. It’s an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it’s to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn’t get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that’s because it took over a year.
Sorry, thanks for the correction.
I personally disagree on this being a good benchmark for outer alignment for various reasons, but it’s good to understand the intention.
Thanks for the summary.
Does machievelli work for chatbots like LIMA?
If not, which do you think is the sota? Anthropic’s?
Yep, it’s a language model agent benchmark. It just feeds a scenario and some actions to an autoregressive LM, and asks the model to select an action.
chatbots don’t map scenarios to actions, they map queries to replies.