Mild optimization: the easiest way to solve hard tasks may be to specify a proxy, which an AI maximizes. The AI steers into configurations which maximize the proxy function. Simple proxies don’t usually have target sets which we like, because human value is complex. However, maybe we just want the AI to randomly select a configuration which satisfies the proxy, instead of finding the maximally-proxy-ness configuration, which may be bad due to extremal Goodhart.
Quantilization tries to solve this by randomly selecting a target configuration from some top quantile, but this is sensitive to how world states are individuated.
This makes sense, but I think you’d need a different notion of optimizing systems than the one used in this post. (In particular, instead of a target configuration set, you want a continuous notion of goodness, like a utility function / reward function.)
I’m saying the target set for non-mild optimization is the set of configurations which maximize proxy-ness. Just take the argmax. By contrast, we might want to sample uniformly randomly from the set of satisficing configurations, which is much larger.
It sounds like you’re assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.
(This is why within this framework I couldn’t formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)
I see the tension you’re pointing at. I think I had in mind something like “an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations”, and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs).
Mild optimization: the easiest way to solve hard tasks may be to specify a proxy, which an AI maximizes. The AI steers into configurations which maximize the proxy function. Simple proxies don’t usually have target sets which we like, because human value is complex. However, maybe we just want the AI to randomly select a configuration which satisfies the proxy, instead of finding the maximally-proxy-ness configuration, which may be bad due to extremal Goodhart.
Quantilization tries to solve this by randomly selecting a target configuration from some top quantile, but this is sensitive to how world states are individuated.
This makes sense, but I think you’d need a different notion of optimizing systems than the one used in this post. (In particular, instead of a target configuration set, you want a continuous notion of goodness, like a utility function / reward function.)
I’m saying the target set for non-mild optimization is the set of configurations which maximize proxy-ness. Just take the argmax. By contrast, we might want to sample uniformly randomly from the set of satisficing configurations, which is much larger.
(This is assuming a fixed initial state)
It sounds like you’re assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.
(This is why within this framework I couldn’t formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)
I see the tension you’re pointing at. I think I had in mind something like “an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations”, and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs).
I’m still confused about parts of this.