Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
I see [the] mesa optimization [problem (i.e. inner alignment)] as a generalization of Goodhart’s Law[, which is that a]ny time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
For example, if an employee is rewarded by the number of cars sold each month, they will try to sell more cars, even at a loss.
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.