I want advocates of strong coherence to explain why [...] sophisticated ML systems (e.g. foundation models[5]) aren’t strongly coherent.
I wouldn’t call myself an advocate for strong coherence, but two answers come to mind depending on how strong coherence is defined:
If the definition of strong coherence requires that the learned utility function is simple, like a laser focused goal that is small relative to state/action space, then I’d say big prediction-trained foundation models should not be expected to be strongly coherent because their training objective is extremely broad and so induces a broader learned utility.
If the definition doesn’t require that the utility function be simple and focused, then I might say that the foundation models are pretty strongly coherent, in the sense that the behavior of the model in one context does not step on its own toes in another context. The utility function of the predictor for any context is in terms of the next immediate actions that it outputs. It’s broad, densely defined during training, and perfectly shallow.
Is “mode collapse” in RLHF’ed models an example of increased coherence?
I think the more useful near-term frame is that RLHF- mostly when applied as a not-conditioning-equivalent fine-tuning RL process- is giving the optimizer more room to roam, and removing the constraints that forced the prediction training to maintain output distributions. The reward function for the fine-tuned trainee looks sparser. KL penalties can help maintain some of the previous constraints, but it looks like the usual difficulties with RL prevent that from being as robust as pure predictive training.
In the limit, I would assume RL leads to increased “coherence” in the sense of sharpness, because it’s very likely that the learned reward induced by RL training is much narrower and sparser than the predictive training objective. (It may not increase coherence in the sense of “not stepping on its own toes” because it was already pretty good at that.)
Eliciting the desired behavior through conditioning learned during the densely defined predictive training objective rather than a pure RL postpass seems wise regardless.
I wouldn’t call myself an advocate for strong coherence, but two answers come to mind depending on how strong coherence is defined:
If the definition of strong coherence requires that the learned utility function is simple, like a laser focused goal that is small relative to state/action space, then I’d say big prediction-trained foundation models should not be expected to be strongly coherent because their training objective is extremely broad and so induces a broader learned utility.
If the definition doesn’t require that the utility function be simple and focused, then I might say that the foundation models are pretty strongly coherent, in the sense that the behavior of the model in one context does not step on its own toes in another context. The utility function of the predictor for any context is in terms of the next immediate actions that it outputs. It’s broad, densely defined during training, and perfectly shallow.
(I very recently posted a thingy about this.)
I think the more useful near-term frame is that RLHF- mostly when applied as a not-conditioning-equivalent fine-tuning RL process- is giving the optimizer more room to roam, and removing the constraints that forced the prediction training to maintain output distributions. The reward function for the fine-tuned trainee looks sparser. KL penalties can help maintain some of the previous constraints, but it looks like the usual difficulties with RL prevent that from being as robust as pure predictive training.
In the limit, I would assume RL leads to increased “coherence” in the sense of sharpness, because it’s very likely that the learned reward induced by RL training is much narrower and sparser than the predictive training objective. (It may not increase coherence in the sense of “not stepping on its own toes” because it was already pretty good at that.)
Eliciting the desired behavior through conditioning learned during the densely defined predictive training objective rather than a pure RL postpass seems wise regardless.