I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.
If predicting webtext is a good way to get things done, people can do that. But probably it isn’t, and so people probably won’t do that unless you give them a good reason.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
The main way I can see it going is that you can condition the webtext model on other things like “there is a future AGI generating this text...” or “What action leads to consequence X?” But I think those things are radically less safe than predicting demonstrations in the lab, and lead to almost all the same difficulties if they in fact improve capabilities.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”? There is one form of this I agree with—models trained with RLHF will likely try to produce outputs humans rate highly, including by e.g. producing outputs that drive humans insane to give them a good rating or whatever. But overall people seem to be reacting to some different more associative reason for concern that I don’t think makes sense (yet).
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense.
So does conditioning the model to get it to do something useful. Also I think “focuses the model’s computation on agency in some sense” is probably too vague to be a helpful way to think about what’s going on—it seems like it leads the model to produce outputs that it thinks would have certain kinds of consequences, or that imitate the kinds of heuristics and processes used by consequentialists in the dataset. This happens quite a lot when you continue webtext, since it’s all written by consequentialists.
I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).
I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.
The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -
The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”?
I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.
So does conditioning the model to get it to do something useful.
I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.
(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
If predicting webtext is a good way to get things done, people can do that. But probably it isn’t, and so people probably won’t do that unless you give them a good reason.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
The main way I can see it going is that you can condition the webtext model on other things like “there is a future AGI generating this text...” or “What action leads to consequence X?” But I think those things are radically less safe than predicting demonstrations in the lab, and lead to almost all the same difficulties if they in fact improve capabilities.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”? There is one form of this I agree with—models trained with RLHF will likely try to produce outputs humans rate highly, including by e.g. producing outputs that drive humans insane to give them a good rating or whatever. But overall people seem to be reacting to some different more associative reason for concern that I don’t think makes sense (yet).
So does conditioning the model to get it to do something useful. Also I think “focuses the model’s computation on agency in some sense” is probably too vague to be a helpful way to think about what’s going on—it seems like it leads the model to produce outputs that it thinks would have certain kinds of consequences, or that imitate the kinds of heuristics and processes used by consequentialists in the dataset. This happens quite a lot when you continue webtext, since it’s all written by consequentialists.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.
I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).
I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.
The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -
- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.
I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.
I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.
(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.