davidad comments on Trying to disambiguate different questions about whether RLHF is “good”

davidad Dec 14, 2022, 12:01 PM
LW: 9 AF: 4
2
AF
Here’s my steelman of this argument:
1. There is some quantity called a “level of performance”.
2. A certain level of performance, $P_{1}$ , is necessary to assist humans in ending the acute risk period.
3. A higher level of performance, $P_{2}$ , is necessary for a treacherous turn.
4. Any given alignment strategy $A$ is associated with a factor $λ_{A} \in [0, 1]$ , such that it can convert an unaligned model with performance $P$ into an aligned model with performance $λ_{A} P$ .
5. The maximum achievable performance of unaligned models increases somewhat gradually as a function of time $P (t)$ .
6. Given two alignment strategies $A$ and $B$ such that $λ_{A} > λ_{B}$ , it is more likely that $\exists t . λ_{A} P (t) \geq P_{1} \land P (t) < P_{2}$ than that $\exists t . λ_{B} P (t) \geq P_{1} \land P (t) < P_{2}$ .
7. Therefore, a treacherous turn is less likely in a world with alignment strategy $A$ than in a world with only alignment strategy $B$ .
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
- Rohin Shah Dec 22, 2022, 6:19 AM
  LW: 4 AF: 3
  −2
  AF Parent
  You can weaken the premises a lot:
  1. Instead of having a single “level of performance”, you can have different levels of performance for “ending the acute risk period” vs “treacherous turn”. This allows you to get rid of the “higher” part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about $λ_{A}$ factors for particular tasks on particular models rather than for “overall performance” in premise 4, which seems much more realistic (since $λ_{A}$ could vary for different tasks or different models).
  2. You can get rid of Premise 5 entirely, if you have Premise 6 say “at least as likely” rather than “more likely”.
  I’d rewrite as:
  Premise 1: There is some quantity called a “level of performance at a given task”.
  Definition 1: Let $P_{S}$ be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and $P_{S}^{*}$ , is the level of performance you want.
  Definition 2: Let $P_{T}$ be the level of performance at executing a treacherous turn, and $P_{T}^{*}$ be the level of performance required to be successful.
  Definition 3: Consider a setting where we apply an alignment strategy $A$ to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance $(P_{S}, P_{T})$ . Then, the resulting model must have actual capabilities $(λ_{S}^{A} P_{S}, λ_{T}^{A} P_{T})$ for some factors $λ_{S}^{A}, λ_{T}^{A} \in [0, \infty]$ . (If you assume that $A$ can only elicit capabilities rather than creating new capabilities, then you have $λ_{*}^{A} \leq 1$ .)
  Lemma 1: Consider two alignment strategies $A$ and $B$ applied to a model M, where you are uncertain about the latent capabilities $(P_{S}, P_{T})$ , but you know that $λ_{S}^{A} > λ_{S}^{B}$ , and $λ_{T}^{A} = λ_{T}^{B}$ . Then, it is at least as likely that $λ_{S}^{A} P_{S} \geq P_{S}^{*} \land λ_{T}^{A} P_{T} < P_{T}^{*}$ than that $λ_{S}^{B} P_{S} \geq P_{S}^{*} \land λ_{T}^{B} P_{T} < P_{T}^{*}$ .
  Premise 2: $λ_{S}^{R L H F} > λ_{S}^{p r o m p t}$
  Premise 3: $λ_{T}^{R L H F} = λ_{T}^{p r o m p t}$ . This corresponds to Buck’s point, which I agree with, here:
  I don’t think RLHF seems worse than these other ways you could try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)
  Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt engineering.
  (If you consider the task “end the acute risk period”, then this leads to the same conclusion as in your argument, though that bakes in an assumption that we will use a single AI model to end the acute risk period, which I don’t agree with.)
  Critiquing the premises of this new argument:
  Premise 1 is a simplification but seems like a reasonable one.
  Premises 2 and 3 are not actually correct: the more accurate thing to say here is that this is true in expectation given our uncertainty about the underlying parameters. Unfortunately lemma 1 stops being a theorem if you have distributions over $λ_{T}^{A}$ and $λ_{T}^{B}$ . Still, I think the fact that this argument if you replace distributions by their expectations should still be a strong argument in favor of the conclusion absent some reason to disbelieve it.