Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.
Sure—I agree with that. The section I linked from Conditioning Predictive Models actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.
Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.
Sure—I agree with that. The section I linked from Conditioning Predictive Models actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.