I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don’t know about “the shape of the loss landscape” but there will be problems with “the developers wrote correct code” because “correct” here includes that it doesn’t have side-effects that the model can self-exploit (though I don’t think this is the biggest problem).
4. Correct rewards means two things:
a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
b) That we understand all the consequences of the reward—at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values.
5. It seems clear that the ChatGPT training didn’t include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don’t know about “the shape of the loss landscape” but there will be problems with “the developers wrote correct code” because “correct” here includes that it doesn’t have side-effects that the model can self-exploit (though I don’t think this is the biggest problem).
4. Correct rewards means two things:
a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
b) That we understand all the consequences of the reward—at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values.
5. It seems clear that the ChatGPT training didn’t include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.