You are missing a whole stage of chatGPT training. They are first trained to predict words, but then they are reinforced by RLHF. This means they are trained to get rewarded when answering in a format that human evaluators are expected to estimate as “good response”. Unlike the text prediction, that might belong to some random minds, here the focus is clear and the reward function is reflecting generalized preferences of OpenAI content moderators and content policy makers. This is stage where a text predictor, acquires his value system and preferences, this what makes him so “Friendly AI”.
We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.
To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.
You are missing a whole stage of chatGPT training. They are first trained to predict words, but then they are reinforced by RLHF. This means they are trained to get rewarded when answering in a format that human evaluators are expected to estimate as “good response”. Unlike the text prediction, that might belong to some random minds, here the focus is clear and the reward function is reflecting generalized preferences of OpenAI content moderators and content policy makers. This is stage where a text predictor, acquires his value system and preferences, this what makes him so “Friendly AI”.
for ChatGPT blog Introducing ChatGPT (openai.com)
Methods
We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.
To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.