Something that seems like it should be well-known, but I have not seen an explicit reference for:
Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)
—aka “The enemy is smart.”
Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.
This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.
In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).
This then sets up something like a Generative Adversarial Network. The trouble is, such a setup is inherently unstable. Without careful guidance, one of the two adversaries will tend to dominate.
In predator/prey relationships in nature a stable relationship can come about if the predators starve and reproduce less when they eat too many of the prey. If, however, this effect isn’t strong enough (maybe the predators have several prey species), the result is the prey species can go extinct.
Also, the prey species is helped in multi-prey scenarios by becoming less common, and ths less likely to be found and killed by predators and less vulnerable to species-specific disease.
Obviously, these specific considerations don’t apply in a literal sense. I’m trying to point out the general concept that you need counterbalancing factors for an adversarial relationship to stay stable.
Something that seems like it should be well-known, but I have not seen an explicit reference for:
Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)
—aka “The enemy is smart.”
Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.
This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.
In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).
This just moves the proxy-being-Goodharted-against from some hardcoded ruleset to a (presumably human) evaluator or selector of adversarial examples.
This then sets up something like a Generative Adversarial Network. The trouble is, such a setup is inherently unstable. Without careful guidance, one of the two adversaries will tend to dominate.
In predator/prey relationships in nature a stable relationship can come about if the predators starve and reproduce less when they eat too many of the prey. If, however, this effect isn’t strong enough (maybe the predators have several prey species), the result is the prey species can go extinct. Also, the prey species is helped in multi-prey scenarios by becoming less common, and ths less likely to be found and killed by predators and less vulnerable to species-specific disease.
Obviously, these specific considerations don’t apply in a literal sense. I’m trying to point out the general concept that you need counterbalancing factors for an adversarial relationship to stay stable.