I like this kind of idea and have been thinking about it myself. It just makes total sense that all of the training data for the model should at least be passed through a model and augmented/transformed in some fashion to make the next-generation training run on data that has been meticulously curated by a model following the ideal set of values/constitution we’d want to them to have. You give the ‘bomb’ example; I often used a “Mein Kampf” example where you place that kind of data in context to how we’d want an AI to interpret it rather than treating it as equal to any other piece of text.
This post also reminds me of the “Alignment Bitter Lesson” I’ve been ruminating on lately (because I was considering writing a short post on it):
If your alignment agenda doesn’t take into account growing model capabilities, it will be worthless.
Or Davidad’s version:
Any alignment scheme that doesn’t have a good way to leverage increasing AI capabilities to automate most of the R&D required for creating the alignment tech will not be relevant.
I’m unsurprised (and relieved) to hear that other people have been thinking along similar lines — in retrospect this is a really obvious idea. It’s also one whose time has come: people are already training small (few-billion parameter) models on mostly or entirely synthetic data, so it should be very doable to experiment with this alignment technique at that scale, for appropriately simple alignment goals, to see how well it works and larn more about how to make it work well — quite possibly people have already, and just not published the results yet. (I suspect the topic of synthetic training data techniques may be hot/sensitive/competitive enough that it might be a little challenging to publish a “just the alignment techniques, without the capabilities” paper on the topic.)
I like this kind of idea and have been thinking about it myself. It just makes total sense that all of the training data for the model should at least be passed through a model and augmented/transformed in some fashion to make the next-generation training run on data that has been meticulously curated by a model following the ideal set of values/constitution we’d want to them to have. You give the ‘bomb’ example; I often used a “Mein Kampf” example where you place that kind of data in context to how we’d want an AI to interpret it rather than treating it as equal to any other piece of text.
The post reminds me of Beren’s blog post: “Alignment in the Age of Synthetic Data.”
This post also reminds me of the “Alignment Bitter Lesson” I’ve been ruminating on lately (because I was considering writing a short post on it):
Or Davidad’s version:
I’m unsurprised (and relieved) to hear that other people have been thinking along similar lines — in retrospect this is a really obvious idea. It’s also one whose time has come: people are already training small (few-billion parameter) models on mostly or entirely synthetic data, so it should be very doable to experiment with this alignment technique at that scale, for appropriately simple alignment goals, to see how well it works and larn more about how to make it work well — quite possibly people have already, and just not published the results yet. (I suspect the topic of synthetic training data techniques may be hot/sensitive/competitive enough that it might be a little challenging to publish a “just the alignment techniques, without the capabilities” paper on the topic.)