Sam Marks comments on Trying to disambiguate different questions about whether RLHF is “good”

Sam Marks 16 Dec 2022 21:23 UTC
LW: 3 AF: 2
0
AF
In terms of being able to sample from the conditional, I don’t think that the important constraint here is $α + β + γ = 1$ . Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form $α N (μ_{A}, σ_{A}^{2}) + β N (μ_{B}, σ_{B}^{2}) + γ N (μ_{C}, σ_{C}^{2})$ ; even allowing $α, β, γ$ to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness—KL divergence from the base model.
I’m not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example:
1. limitations on what sorts of distributions the architecture could approximate
2. limitations on the latent capabilities in the base model for producing true/persuasive outputs
3. limitations on how much steering each of the various latent capabilities gets to exert ( $α + β + γ = 1$ ).
On my understanding, your point was about limitation (1). But I don’t feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn’t necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty.
I think I’m most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models’ capabilities trade off against one another, and taking $α, β, γ \in [0, 1]$ without additional constraints would have resulted in outputs of mean truthiness $α^{'} μ_{A} + μ_{B}$ for some $α^{'}$ which we can’t pin down without specifying additional details (e.g. is there weight decay?).
- paulfchristiano 16 Dec 2022 21:25 UTC
  LW: 4 AF: 3
  1
  AF Parent
  I’m also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion “fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both.”