I don’t think RLHF in particular had a very large counterfactual impact on commercialization or the arms race. The idea of non-RL instruction tuning for taking base models and making them more useful is very obvious for commercialization (there are multiple concurrent works to InstructGPT). PPO is better than just SFT or simpler approaches on top of SFT, but not groundbreakingly more so. You can compare text-davinci-002 (FeedME) and text-davinci-003 (PPO) to see.
The arms race was directly caused by ChatGPT, which took off quite unexpectedly not because of model quality due to RLHF, but because the UI was much more intuitive to users than the Playground (instruction following GPT3.5 was already in the API and didn’t take off in the same way). The tech tree from having a powerful base model to having a chatbot is not constrained on RLHF existing at all, either.
To be clear, I happen to also not be very optimistic about the alignment relevance of RLHF work beyond the first few papers—certainly if someone were to publish a paper today making RLHF twice as data efficient or whatever I would consider this basically just a capabilities paper.
I don’t think RLHF in particular had a very large counterfactual impact on commercialization or the arms race. The idea of non-RL instruction tuning for taking base models and making them more useful is very obvious for commercialization (there are multiple concurrent works to InstructGPT). PPO is better than just SFT or simpler approaches on top of SFT, but not groundbreakingly more so. You can compare text-davinci-002 (FeedME) and text-davinci-003 (PPO) to see.
The arms race was directly caused by ChatGPT, which took off quite unexpectedly not because of model quality due to RLHF, but because the UI was much more intuitive to users than the Playground (instruction following GPT3.5 was already in the API and didn’t take off in the same way). The tech tree from having a powerful base model to having a chatbot is not constrained on RLHF existing at all, either.
To be clear, I happen to also not be very optimistic about the alignment relevance of RLHF work beyond the first few papers—certainly if someone were to publish a paper today making RLHF twice as data efficient or whatever I would consider this basically just a capabilities paper.