Make base models great again. I’m still nostalgic for GPT-2 or GPT-3. I can understand why RLHF was invented in the first place but it seems to me that you could still train a base model, so that if it’s about to say something dangerous, it just prematurely cuts off the generation by emitting the <endoftext> token instead.
Alternatively, make models natively emit structured data. LLMs in their current form emit free-form arbitrary text which needs to be parsed in all sorts of annoying ways in order to make it useful for any downstream applications anyways. Also, structured output could help with preventing misaligned behavior.
(I’m less confident in this idea than the previous one.)
Make base models great again. I’m still nostalgic for GPT-2 or GPT-3. I can understand why RLHF was invented in the first place but it seems to me that you could still train a base model, so that if it’s about to say something dangerous, it just prematurely cuts off the generation by emitting the
<endoftext>
token instead.Alternatively, make models natively emit structured data. LLMs in their current form emit free-form arbitrary text which needs to be parsed in all sorts of annoying ways in order to make it useful for any downstream applications anyways. Also, structured output could help with preventing misaligned behavior.
(I’m less confident in this idea than the previous one.)