This seems like a good way to think about some of the examples of mode collapse, but doesn’t obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there’s a particular conversational goal which the RLHF’d model is optimizing, such that 97 is the best random number for that goal? In this case, Paul’s guess that RLHF’d models tend to push probability mass onto the base model’s most likely tokens seems more explanatory.
This seems like a good way to think about some of the examples of mode collapse, but doesn’t obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there’s a particular conversational goal which the RLHF’d model is optimizing, such that 97 is the best random number for that goal? In this case, Paul’s guess that RLHF’d models tend to push probability mass onto the base model’s most likely tokens seems more explanatory.