Dalcy comments on Trying to disambiguate different questions about whether RLHF is “good”

Dalcy 14 Dec 2022 11:34 UTC
LW: 3 AF: 1
0
AF
I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.
Wait, that doesn’t really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?
- davidad 14 Dec 2022 12:01 UTC
  LW: 9 AF: 4
  2
  AF Parent
  Here’s my steelman of this argument:
  1. There is some quantity called a “level of performance”.
  2. A certain level of performance, $P_{1}$ , is necessary to assist humans in ending the acute risk period.
  3. A higher level of performance, $P_{2}$ , is necessary for a treacherous turn.
  4. Any given alignment strategy $A$ is associated with a factor $λ_{A} \in [0, 1]$ , such that it can convert an unaligned model with performance $P$ into an aligned model with performance $λ_{A} P$ .
  5. The maximum achievable performance of unaligned models increases somewhat gradually as a function of time $P (t)$ .
  6. Given two alignment strategies $A$ and $B$ such that $λ_{A} > λ_{B}$ , it is more likely that $\exists t . λ_{A} P (t) \geq P_{1} \land P (t) < P_{2}$ than that $\exists t . λ_{B} P (t) \geq P_{1} \land P (t) < P_{2}$ .
  7. Therefore, a treacherous turn is less likely in a world with alignment strategy $A$ than in a world with only alignment strategy $B$ .
  I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
  - Rohin Shah 22 Dec 2022 6:19 UTC
    LW: 4 AF: 3
    −2
    AF Parent
    You can weaken the premises a lot:
    Instead of having a single “level of performance”, you can have different levels of performance for “ending the acute risk period” vs “treacherous turn”. This allows you to get rid of the “higher” part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about $λ_{A}$ factors for particular tasks on particular models rather than for “overall performance” in premise 4, which seems much more realistic (since $λ_{A}$ could vary for different tasks or different models).
    You can get rid of Premise 5 entirely, if you have Premise 6 say “at least as likely” rather than “more likely”.
    I’d rewrite as:
    Premise 1: There is some quantity called a “level of performance at a given task”.
    Definition 1: Let $P_{S}$ be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and $P_{S}^{*}$ , is the level of performance you want.
    Definition 2: Let $P_{T}$ be the level of performance at executing a treacherous turn, and $P_{T}^{*}$ be the level of performance required to be successful.
    Definition 3: Consider a setting where we apply an alignment strategy $A$ to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance $(P_{S}, P_{T})$ . Then, the resulting model must have actual capabilities $(λ_{S}^{A} P_{S}, λ_{T}^{A} P_{T})$ for some factors $λ_{S}^{A}, λ_{T}^{A} \in [0, \infty]$ . (If you assume that $A$ can only elicit capabilities rather than creating new capabilities, then you have $λ_{*}^{A} \leq 1$ .)
    Lemma 1: Consider two alignment strategies $A$ and $B$ applied to a model M, where you are uncertain about the latent capabilities $(P_{S}, P_{T})$ , but you know that $λ_{S}^{A} > λ_{S}^{B}$ , and $λ_{T}^{A} = λ_{T}^{B}$ . Then, it is at least as likely that $λ_{S}^{A} P_{S} \geq P_{S}^{*} \land λ_{T}^{A} P_{T} < P_{T}^{*}$ than that $λ_{S}^{B} P_{S} \geq P_{S}^{*} \land λ_{T}^{B} P_{T} < P_{T}^{*}$ .
    Premise 2: $λ_{S}^{R L H F} > λ_{S}^{p r o m p t}$
    Premise 3: $λ_{T}^{R L H F} = λ_{T}^{p r o m p t}$ . This corresponds to Buck’s point, which I agree with, here:
    I don’t think RLHF seems worse than these other ways you could try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)
    Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt engineering.
    (If you consider the task “end the acute risk period”, then this leads to the same conclusion as in your argument, though that bakes in an assumption that we will use a single AI model to end the acute risk period, which I don’t agree with.)
    Critiquing the premises of this new argument:
    Premise 1 is a simplification but seems like a reasonable one.
    Premises 2 and 3 are not actually correct: the more accurate thing to say here is that this is true in expectation given our uncertainty about the underlying parameters. Unfortunately lemma 1 stops being a theorem if you have distributions over $λ_{T}^{A}$ and $λ_{T}^{B}$ . Still, I think the fact that this argument if you replace distributions by their expectations should still be a strong argument in favor of the conclusion absent some reason to disbelieve it.
- Aaron_Scher 21 Dec 2022 2:35 UTC
  1 point
  0
  Parent
  My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model’s capabilities, which makes it safer because the gap between “capability you can elicit” and “underlying capability capacity” is smaller.
  For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF boosts performance by 40% from pre-trained, so it could get our model’s performance up to 70, but using just prompt engineering isn’t so effective, it only boosts our model’s capabilities by 20%, up to 60. So, if we want to elicit the performance level 70 using prompt engineering, we need a model that is more powerful at baseline (and thus has a more powerful potential). Math: 1.2 x = 70, so x = 58.3 is the model’s pre-trained performance, and such a model has a max capability of 58.3 * 2 = 116.7. So in this example, in order to get the desired behavior, we need a less powerful model with RLHF or a more powerful model with prompt engineering.
  Why it might be good to have models with lower potential capabilities:
  - we might be worried about sudden jumps in capability from being able to better utilize pre-trained models. If “ability we are able to elicit” is only slightly below “maximum possible ability” then this is less of a problem.
  - we might be worried about a treacherous turn in which an AI is able to utilize almost all of its potential capabilities; we would be taken by surprise and in trouble if this included capabilities much greater than what we have (or even what we know exist).
  - accidents will be less bad if our model has overall lower capability potential
  - [edited to add] This list is not meant to be comprehensive, but here’s an important one: we might expect models with capabilities above some threshold to be deceptive and capable of a treacherous turn. In some sense, we want to get close to this line without crossing it, in order to take actions which protect the world from misaligned AIs. In the above example, we could imagine that getting a max capability above 105 results in a deceptive model which can execute a treacherous turn. Using RLHF on the 100-max model allows us to get the behavior we wanted, but if we are using prompt engineering than we would need to make a dangerous model to get the desired behavior.