OpenAI has in the past not been that transparent about these questions, but in this case, the blog post (linked in my post) makes it very clear it’s trained with reinforcement learning from human feedback.
However, of course it was initially pretrained in an unsupervised fashion (it’s based on GPT-3), so it seems hard to know whether this specific behavior was “due to the RL” or “a likely continuation”.
OpenAI has in the past not been that transparent about these questions, but in this case, the blog post (linked in my post) makes it very clear it’s trained with reinforcement learning from human feedback.
However, of course it was initially pretrained in an unsupervised fashion (it’s based on GPT-3), so it seems hard to know whether this specific behavior was “due to the RL” or “a likely continuation”.