From the paper’s results, the approach is very effective, my post discusses how to make it very controllable and flexible, and it has the particular advantage that since it’s done at pretraining time it can’t just be easily fine-tuned away out of an open-source model (admittedly, the latter might do more for your employability at Meta FAIR Paris or Mistral than at DeepMind — but then, which of those seem like the higher x-risk to solve?)
If I were in your position, I would work on the ideas described in my post How to Control an LLM’s Behavior and the paper Pretraining Language Models with Human Preferences that inspired it.
From the paper’s results, the approach is very effective, my post discusses how to make it very controllable and flexible, and it has the particular advantage that since it’s done at pretraining time it can’t just be easily fine-tuned away out of an open-source model (admittedly, the latter might do more for your employability at Meta FAIR Paris or Mistral than at DeepMind — but then, which of those seem like the higher x-risk to solve?)
I like this idea, can I DM you about the research frontier?
Of course. I also wrote a second post on another possible specific application of this approach: Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment.