Glad to see engagement on this—and I should probably respond to some of these points, but before doing so, want to point to where I’ve already done work on this, since much of that work either admits your points, or addresses them.
First, I think you should read the paper I wrote with Scott that extended the thoughts from his post. It certainly doesn’t address all of this, but we were very clear that adversarial Goodhart was less clear than the other modes and needed further work. We also more clearly drew the connection to tails fall apart, and clarified some of the sub-cases of both extremal and causal Goodhart. Following that, I wrote another post on the topic, trying to expand on the points made in the paper—but specifically excluding multi-agent issues, because they were hard and I wasn’t clear enough about how they worked.
I tried to do a bit of that work in a paper, Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence. This attempts to provide a categorization for multi-agent cases similar to the one made in Scott’s post. It made a few key points that I think need further discussion about the relationship to embedded agents, and other issues. I was less successful than I hoped at cutting through the confusion, but a key point it does make is that all multi-agent failures are actually single agent failure modes, but they are caused by misaligned goals or coordination failures. (And these aren’t all principal-agent issues, though I agree that many are. For instance, some cases are tragedy of the commons, and others are more direct corruption of the other agents.) I also summarized the paper a bit and expanded on certain key points in another lesswrong post.
And since I’m giving a reading list, I also think my even more recent, but only partially-completed sequence of posts on optimization and selection versus control (in the single agent cases) might clarify some of the points about Regressional versus Extremal Goodhart further. Post one of that sequence is here.
Glad to see engagement on this—and I should probably respond to some of these points, but before doing so, want to point to where I’ve already done work on this, since much of that work either admits your points, or addresses them.
First, I think you should read the paper I wrote with Scott that extended the thoughts from his post. It certainly doesn’t address all of this, but we were very clear that adversarial Goodhart was less clear than the other modes and needed further work. We also more clearly drew the connection to tails fall apart, and clarified some of the sub-cases of both extremal and causal Goodhart. Following that, I wrote another post on the topic, trying to expand on the points made in the paper—but specifically excluding multi-agent issues, because they were hard and I wasn’t clear enough about how they worked.
I tried to do a bit of that work in a paper, Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence. This attempts to provide a categorization for multi-agent cases similar to the one made in Scott’s post. It made a few key points that I think need further discussion about the relationship to embedded agents, and other issues. I was less successful than I hoped at cutting through the confusion, but a key point it does make is that all multi-agent failures are actually single agent failure modes, but they are caused by misaligned goals or coordination failures. (And these aren’t all principal-agent issues, though I agree that many are. For instance, some cases are tragedy of the commons, and others are more direct corruption of the other agents.) I also summarized the paper a bit and expanded on certain key points in another lesswrong post.
And since I’m giving a reading list, I also think my even more recent, but only partially-completed sequence of posts on optimization and selection versus control (in the single agent cases) might clarify some of the points about Regressional versus Extremal Goodhart further. Post one of that sequence is here.