Zac Hatfield-Dodds comments on Doing oversight from the very start of training seems hard

Zac Hatfield-Dodds 21 Sep 2022 8:14 UTC
1 point
0
IMO scaling laws make this actually pretty easy. The curves are smooth, the relationship between capabilities of a smaller model and a less-trained larger model seem fairly reliable, and obviously you can just use your previous or smaller model to bootstrap and/or crosscheck the new overseer. Bootstrapping trust at all is hard to ground out (humans with interpretability tools?), and I expect oversight to be difficult, but I don’t expect doing it from the start of training to present any particular challenge.

(opinions my own, you know the drill)