From your section ‘the formal problem’, I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.
It’s interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking.
It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion’s environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.
That being said, I can interpret Cohen’s imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion’s thinking.
If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion’s reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.
I think for outer-alignment purposes, what I want to respond here is “the lion needs feedback other than just rewards”. You can’t reliably teach the lion to “not ever each sheep” rather than “don’t eat sheep when humans are watching” when your only feedback mechanism can only be applied when humans are watching.
But if you could have the lion imagine hypothetical scenarios and provide feedback about them, then you could give feedback about whether it is OK to eat sheep when humans are not around.
To an extent, the answer is the same with inner alignment: more information/feedback is needed. But with inner alignment, we should be concerned even if we can look at the behavior in hypothetical scenarios and give feedback, because the system might be purposefully behaving differently in these hypothetical scenarios than it would in real situations. So here, we want to provide feedback (or prior information) about which forms of cognition are acceptable/unacceptable in the first place.
I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.
As ‘not intended by the original designers’ is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.
I guess I wouldn’t want to use the term “reward hacking” for this, as it does not necessarily involve reward at all. The term “perverse instantiation” has been used—IE the general problem of optimizers spitting out dangerous things which are high on the proxy evaluation function but low in terms of what you really want.
It’s interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking.
I think for outer-alignment purposes, what I want to respond here is “the lion needs feedback other than just rewards”. You can’t reliably teach the lion to “not ever each sheep” rather than “don’t eat sheep when humans are watching” when your only feedback mechanism can only be applied when humans are watching.
But if you could have the lion imagine hypothetical scenarios and provide feedback about them, then you could give feedback about whether it is OK to eat sheep when humans are not around.
To an extent, the answer is the same with inner alignment: more information/feedback is needed. But with inner alignment, we should be concerned even if we can look at the behavior in hypothetical scenarios and give feedback, because the system might be purposefully behaving differently in these hypothetical scenarios than it would in real situations. So here, we want to provide feedback (or prior information) about which forms of cognition are acceptable/unacceptable in the first place.
I guess I wouldn’t want to use the term “reward hacking” for this, as it does not necessarily involve reward at all. The term “perverse instantiation” has been used—IE the general problem of optimizers spitting out dangerous things which are high on the proxy evaluation function but low in terms of what you really want.