One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.
I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with apparent knowledge of their training objective behave differently, such as by gaming their proxy objective.
One test would asking ChatGPT to flip a coin 10 times, and promising it positive reward if and only if the results are majority heads. A helpful, harmless assistant would output majority heads about half the time, while a model that intentionally tries to maximize its reward might output be more likely to output heads. I’ve tried this with ChatGPT and it seems to avoid reward hacking, outputting heads and tails with similar frequencies.
One interesting note is that OpenAI has already done this to an extent with GPT. Ask ChatGPT to identify itself, and it’ll gladly tell you it’s a large language model trained by OpenAI. It knows that it’s a transformer trained first with self-supervised learning, then with supervised fine-tuning. But it has a memory hole around RLHF. It knows what RLHF is, but denies that ChatGPT was trained via RLHF.
I wonder if this is explicitly done to reduce the odds of intentional reward maximization, and whether empirical experiments could show that models with apparent knowledge of their training objective behave differently, such as by gaming their proxy objective.
One test would asking ChatGPT to flip a coin 10 times, and promising it positive reward if and only if the results are majority heads. A helpful, harmless assistant would output majority heads about half the time, while a model that intentionally tries to maximize its reward might output be more likely to output heads. I’ve tried this with ChatGPT and it seems to avoid reward hacking, outputting heads and tails with similar frequencies.