I’m unsurprised (and relieved) to hear that other people have been thinking along similar lines — in retrospect this is a really obvious idea. It’s also one whose time has come: people are already training small (few-billion parameter) models on mostly or entirely synthetic data, so it should be very doable to experiment with this alignment technique at that scale, for appropriately simple alignment goals, to see how well it works and larn more about how to make it work well — quite possibly people have already, and just not published the results yet. (I suspect the topic of synthetic training data techniques may be hot/sensitive/competitive enough that it might be a little challenging to publish a “just the alignment techniques, without the capabilities” paper on the topic.)
I’m unsurprised (and relieved) to hear that other people have been thinking along similar lines — in retrospect this is a really obvious idea. It’s also one whose time has come: people are already training small (few-billion parameter) models on mostly or entirely synthetic data, so it should be very doable to experiment with this alignment technique at that scale, for appropriately simple alignment goals, to see how well it works and larn more about how to make it work well — quite possibly people have already, and just not published the results yet. (I suspect the topic of synthetic training data techniques may be hot/sensitive/competitive enough that it might be a little challenging to publish a “just the alignment techniques, without the capabilities” paper on the topic.)