I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.
Wait, that doesn’t really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?
There is some quantity called a “level of performance”.
A certain level of performance, P1, is necessary to assist humans in ending the acute risk period.
A higher level of performance, P2, is necessary for a treacherous turn.
Any given alignment strategy A is associated with a factor λA∈[0,1], such that it can convert an unaligned model with performance P into an aligned model with performance λAP.
The maximum achievable performance of unaligned models increases somewhat gradually as a function of time P(t).
Given two alignment strategies A and B such that λA>λB, it is more likely that ∃t.λAP(t)≥P1∧P(t)<P2 than that ∃t.λBP(t)≥P1∧P(t)<P2.
Therefore, a treacherous turn is less likely in a world with alignment strategy A than in a world with only alignment strategy B.
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
Instead of having a single “level of performance”, you can have different levels of performance for “ending the acute risk period” vs “treacherous turn”. This allows you to get rid of the “higher” part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about λA factors for particular tasks on particular models rather than for “overall performance” in premise 4, which seems much more realistic (since λA could vary for different tasks or different models).
You can get rid of Premise 5 entirely, if you have Premise 6 say “at least as likely” rather than “more likely”.
I’d rewrite as:
Premise 1: There is some quantity called a “level of performance at a given task”.
Definition 1: Let PS be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and P∗S, is the level of performance you want.
Definition 2: Let PT be the level of performance at executing a treacherous turn, and P∗T be the level of performance required to be successful.
Definition 3: Consider a setting where we apply an alignment strategy A to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance (PS,PT). Then, the resulting model must have actual capabilities (λASPS,λATPT) for some factors λAS,λAT∈[0,∞]. (If you assume that A can only elicit capabilities rather than creating new capabilities, then you have λA∗≤1.)
Lemma 1: Consider two alignment strategies A and B applied to a model M, where you are uncertain about the latent capabilities (PS,PT), but you know that λAS>λBS, and λAT=λBT. Then, it is at least as likely that λASPS≥P∗S∧λATPT<P∗T than that λBSPS≥P∗S∧λBTPT<P∗T.
Premise 2: λRLHFS>λpromptS
Premise 3: λRLHFT=λpromptT . This corresponds to Buck’s point, which I agree with, here:
I don’t think RLHF seems worse than these other ways you could try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)
Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt engineering.
(If you consider the task “end the acute risk period”, then this leads to the same conclusion as in your argument, though that bakes in an assumption that we will use a single AI model to end the acute risk period, which I don’t agree with.)
Critiquing the premises of this new argument:
Premise 1 is a simplification but seems like a reasonable one.
Premises 2 and 3 are not actually correct: the more accurate thing to say here is that this is true in expectation given our uncertainty about the underlying parameters. Unfortunately lemma 1 stops being a theorem if you have distributions over λAT and λBT. Still, I think the fact that this argument if you replace distributions by their expectations should still be a strong argument in favor of the conclusion absent some reason to disbelieve it.
My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model’s capabilities, which makes it safer because the gap between “capability you can elicit” and “underlying capability capacity” is smaller.
For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF boosts performance by 40% from pre-trained, so it could get our model’s performance up to 70, but using just prompt engineering isn’t so effective, it only boosts our model’s capabilities by 20%, up to 60. So, if we want to elicit the performance level 70 using prompt engineering, we need a model that is more powerful at baseline (and thus has a more powerful potential). Math: 1.2 x = 70, so x = 58.3 is the model’s pre-trained performance, and such a model has a max capability of 58.3 * 2 = 116.7. So in this example, in order to get the desired behavior, we need a less powerful model with RLHF or a more powerful model with prompt engineering.
Why it might be good to have models with lower potential capabilities:
we might be worried about sudden jumps in capability from being able to better utilize pre-trained models. If “ability we are able to elicit” is only slightly below “maximum possible ability” then this is less of a problem.
we might be worried about a treacherous turn in which an AI is able to utilize almost all of its potential capabilities; we would be taken by surprise and in trouble if this included capabilities much greater than what we have (or even what we know exist).
accidents will be less bad if our model has overall lower capability potential
[edited to add] This list is not meant to be comprehensive, but here’s an important one: we might expect models with capabilities above some threshold to be deceptive and capable of a treacherous turn. In some sense, we want to get close to this line without crossing it, in order to take actions which protect the world from misaligned AIs. In the above example, we could imagine that getting a max capability above 105 results in a deceptive model which can execute a treacherous turn. Using RLHF on the 100-max model allows us to get the behavior we wanted, but if we are using prompt engineering than we would need to make a dangerous model to get the desired behavior.
Wait, that doesn’t really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?
Here’s my steelman of this argument:
There is some quantity called a “level of performance”.
A certain level of performance, P1, is necessary to assist humans in ending the acute risk period.
A higher level of performance, P2, is necessary for a treacherous turn.
Any given alignment strategy A is associated with a factor λA∈[0,1], such that it can convert an unaligned model with performance P into an aligned model with performance λAP.
The maximum achievable performance of unaligned models increases somewhat gradually as a function of time P(t).
Given two alignment strategies A and B such that λA>λB, it is more likely that ∃t.λAP(t)≥P1∧P(t)<P2 than that ∃t.λBP(t)≥P1∧P(t)<P2.
Therefore, a treacherous turn is less likely in a world with alignment strategy A than in a world with only alignment strategy B.
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
You can weaken the premises a lot:
Instead of having a single “level of performance”, you can have different levels of performance for “ending the acute risk period” vs “treacherous turn”. This allows you to get rid of the “higher” part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about λA factors for particular tasks on particular models rather than for “overall performance” in premise 4, which seems much more realistic (since λA could vary for different tasks or different models).
You can get rid of Premise 5 entirely, if you have Premise 6 say “at least as likely” rather than “more likely”.
I’d rewrite as:
Premise 1: There is some quantity called a “level of performance at a given task”.
Definition 1: Let PS be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and P∗S, is the level of performance you want.
Definition 2: Let PT be the level of performance at executing a treacherous turn, and P∗T be the level of performance required to be successful.
Definition 3: Consider a setting where we apply an alignment strategy A to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance (PS,PT). Then, the resulting model must have actual capabilities (λASPS,λATPT) for some factors λAS,λAT∈[0,∞]. (If you assume that A can only elicit capabilities rather than creating new capabilities, then you have λA∗≤1.)
Lemma 1: Consider two alignment strategies A and B applied to a model M, where you are uncertain about the latent capabilities (PS,PT), but you know that λAS>λBS, and λAT=λBT. Then, it is at least as likely that λASPS≥P∗S∧λATPT<P∗T than that λBSPS≥P∗S∧λBTPT<P∗T.
Premise 2: λRLHFS>λpromptS
Premise 3: λRLHFT=λpromptT . This corresponds to Buck’s point, which I agree with, here:
Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt engineering.
(If you consider the task “end the acute risk period”, then this leads to the same conclusion as in your argument, though that bakes in an assumption that we will use a single AI model to end the acute risk period, which I don’t agree with.)
Critiquing the premises of this new argument:
Premise 1 is a simplification but seems like a reasonable one.
Premises 2 and 3 are not actually correct: the more accurate thing to say here is that this is true in expectation given our uncertainty about the underlying parameters. Unfortunately lemma 1 stops being a theorem if you have distributions over λAT and λBT. Still, I think the fact that this argument if you replace distributions by their expectations should still be a strong argument in favor of the conclusion absent some reason to disbelieve it.
My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model’s capabilities, which makes it safer because the gap between “capability you can elicit” and “underlying capability capacity” is smaller.
For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF boosts performance by 40% from pre-trained, so it could get our model’s performance up to 70, but using just prompt engineering isn’t so effective, it only boosts our model’s capabilities by 20%, up to 60. So, if we want to elicit the performance level 70 using prompt engineering, we need a model that is more powerful at baseline (and thus has a more powerful potential). Math: 1.2 x = 70, so x = 58.3 is the model’s pre-trained performance, and such a model has a max capability of 58.3 * 2 = 116.7. So in this example, in order to get the desired behavior, we need a less powerful model with RLHF or a more powerful model with prompt engineering.
Why it might be good to have models with lower potential capabilities:
we might be worried about sudden jumps in capability from being able to better utilize pre-trained models. If “ability we are able to elicit” is only slightly below “maximum possible ability” then this is less of a problem.
we might be worried about a treacherous turn in which an AI is able to utilize almost all of its potential capabilities; we would be taken by surprise and in trouble if this included capabilities much greater than what we have (or even what we know exist).
accidents will be less bad if our model has overall lower capability potential
[edited to add] This list is not meant to be comprehensive, but here’s an important one: we might expect models with capabilities above some threshold to be deceptive and capable of a treacherous turn. In some sense, we want to get close to this line without crossing it, in order to take actions which protect the world from misaligned AIs. In the above example, we could imagine that getting a max capability above 105 results in a deceptive model which can execute a treacherous turn. Using RLHF on the 100-max model allows us to get the behavior we wanted, but if we are using prompt engineering than we would need to make a dangerous model to get the desired behavior.