Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).