paulfchristiano comments on Trying to disambiguate different questions about whether RLHF is “good”

paulfchristiano 16 Dec 2022 19:43 UTC
LW: 4 AF: 3
0
AF
Note that in this example your model is unable to sample from the conditional you specified, since it is restricted to $α + β + γ = 1$ . In this regime truthfulness and persuasiveness are anticorrelated because of a capability constraint of your model, it just literally isn’t able to increase both at the same time, and conditioning can do better because you are generating lots of samples and picking the best.
(You point this out in your comment, but it seems worth emphasizing. As you say, if you do RL with a KL penalty, then the capability limit is the only way you can get this kind of mismatch. Without a KL penalty the exact behavior of RL vs conditioning will depend on details of gradient descent, though it seems quite similar in practice and I’m not sure which way this comparison goes.)
- Sam Marks 16 Dec 2022 21:23 UTC
  LW: 3 AF: 2
  0
  AF Parent
  In terms of being able to sample from the conditional, I don’t think that the important constraint here is $α + β + γ = 1$ . Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form $α N (μ_{A}, σ_{A}^{2}) + β N (μ_{B}, σ_{B}^{2}) + γ N (μ_{C}, σ_{C}^{2})$ ; even allowing $α, β, γ$ to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness—KL divergence from the base model.
  I’m not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example:
  1. limitations on what sorts of distributions the architecture could approximate
  2. limitations on the latent capabilities in the base model for producing true/persuasive outputs
  3. limitations on how much steering each of the various latent capabilities gets to exert ( $α + β + γ = 1$ ).
  On my understanding, your point was about limitation (1). But I don’t feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn’t necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty.
  I think I’m most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models’ capabilities trade off against one another, and taking $α, β, γ \in [0, 1]$ without additional constraints would have resulted in outputs of mean truthiness $α^{'} μ_{A} + μ_{B}$ for some $α^{'}$ which we can’t pin down without specifying additional details (e.g. is there weight decay?).
  - paulfchristiano 16 Dec 2022 21:25 UTC
    LW: 4 AF: 3
    1
    AF Parent
    I’m also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion “fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both.”