Yeah, you could reformulate the question as “how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?” Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it’s trained with RL to successfully prove theorems you can argue it’s a consequentialist since it’s the distillation of a planning process, but it’s also relatively myopic in the sense that it doesn’t care about anything that happens after the current theorem is proved. My sense is that in practice it’ll matter a lot where you draw your episode boundaries (at least in the medium term), and as you point out there are a bunch of tricky open questions on how to think about this.
I agree with your evaluation of behavior point. I also agree that the motives matter but an important consideration is whether you picture them coming from an RM (which we can test extensively and hopefully interpret somewhat) or some opaque inner optimizers. I’m pretty bullish on both evaluating the RM (average case + adversarially) and the behavior.
Yeah, you could reformulate the question as “how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?” Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it’s trained with RL to successfully prove theorems you can argue it’s a consequentialist since it’s the distillation of a planning process, but it’s also relatively myopic in the sense that it doesn’t care about anything that happens after the current theorem is proved. My sense is that in practice it’ll matter a lot where you draw your episode boundaries (at least in the medium term), and as you point out there are a bunch of tricky open questions on how to think about this.
I agree with your evaluation of behavior point. I also agree that the motives matter but an important consideration is whether you picture them coming from an RM (which we can test extensively and hopefully interpret somewhat) or some opaque inner optimizers. I’m pretty bullish on both evaluating the RM (average case + adversarially) and the behavior.