I’m not intending to use Def’n 2 at all. The hope here is not that we can “rest assured that there is no dangerous consequentialist means-end reasoning” due to e.g. it not fitting into the context in question. The hope is merely that if we don’t specifically differentially reinforce unintended behavior, there’s a chance we won’t get it (even if there is scope to do it).
I see your point that consistently, effectively “boxing” an AI during training could also be a way to avoid reinforcing behaviors we’re worried about. But they don’t seem the same to me: I think you can get the (admittedly limited) benefit of process-based supervision without boxing. Boxing an AI during training might have various challenges and competitiveness costs. Process-based supervision means you can allow an unrestricted scope of action, while avoiding specifically reinforcing various unintended behaviors. That seems different from boxing.
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.