IMO scaling laws make this actually pretty easy. The curves are smooth, the relationship between capabilities of a smaller model and a less-trained larger model seem fairly reliable, and obviously you can just use your previous or smaller model to bootstrap and/or crosscheck the new overseer. Bootstrapping trust at all is hard to ground out (humans with interpretability tools?), and I expect oversight to be difficult, but I don’t expect doing it from the start of training to present any particular challenge.
IMO scaling laws make this actually pretty easy. The curves are smooth, the relationship between capabilities of a smaller model and a less-trained larger model seem fairly reliable, and obviously you can just use your previous or smaller model to bootstrap and/or crosscheck the new overseer. Bootstrapping trust at all is hard to ground out (humans with interpretability tools?), and I expect oversight to be difficult, but I don’t expect doing it from the start of training to present any particular challenge.
(opinions my own, you know the drill)