Maybe you’ve addressed this elsewhere but isn’t scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.
Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.
Do you think putting extra effort into learning about existing empirical work while doing conceptual work would be sufficient for good conceptual work or do you think people need to be producing empirical work themselves to really make progress conceptually?