There is some quantity called a “level of performance”.
A certain level of performance, P1, is necessary to assist humans in ending the acute risk period.
A higher level of performance, P2, is necessary for a treacherous turn.
Any given alignment strategy A is associated with a factor λA∈[0,1], such that it can convert an unaligned model with performance P into an aligned model with performance λAP.
The maximum achievable performance of unaligned models increases somewhat gradually as a function of time P(t).
Given two alignment strategies A and B such that λA>λB, it is more likely that ∃t.λAP(t)≥P1∧P(t)<P2 than that ∃t.λBP(t)≥P1∧P(t)<P2.
Therefore, a treacherous turn is less likely in a world with alignment strategy A than in a world with only alignment strategy B.
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
Instead of having a single “level of performance”, you can have different levels of performance for “ending the acute risk period” vs “treacherous turn”. This allows you to get rid of the “higher” part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about λA factors for particular tasks on particular models rather than for “overall performance” in premise 4, which seems much more realistic (since λA could vary for different tasks or different models).
You can get rid of Premise 5 entirely, if you have Premise 6 say “at least as likely” rather than “more likely”.
I’d rewrite as:
Premise 1: There is some quantity called a “level of performance at a given task”.
Definition 1: Let PS be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and P∗S, is the level of performance you want.
Definition 2: Let PT be the level of performance at executing a treacherous turn, and P∗T be the level of performance required to be successful.
Definition 3: Consider a setting where we apply an alignment strategy A to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance (PS,PT). Then, the resulting model must have actual capabilities (λASPS,λATPT) for some factors λAS,λAT∈[0,∞]. (If you assume that A can only elicit capabilities rather than creating new capabilities, then you have λA∗≤1.)
Lemma 1: Consider two alignment strategies A and B applied to a model M, where you are uncertain about the latent capabilities (PS,PT), but you know that λAS>λBS, and λAT=λBT. Then, it is at least as likely that λASPS≥P∗S∧λATPT<P∗T than that λBSPS≥P∗S∧λBTPT<P∗T.
Premise 2: λRLHFS>λpromptS
Premise 3: λRLHFT=λpromptT . This corresponds to Buck’s point, which I agree with, here:
I don’t think RLHF seems worse than these other ways you could try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)
Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt engineering.
(If you consider the task “end the acute risk period”, then this leads to the same conclusion as in your argument, though that bakes in an assumption that we will use a single AI model to end the acute risk period, which I don’t agree with.)
Critiquing the premises of this new argument:
Premise 1 is a simplification but seems like a reasonable one.
Premises 2 and 3 are not actually correct: the more accurate thing to say here is that this is true in expectation given our uncertainty about the underlying parameters. Unfortunately lemma 1 stops being a theorem if you have distributions over λAT and λBT. Still, I think the fact that this argument if you replace distributions by their expectations should still be a strong argument in favor of the conclusion absent some reason to disbelieve it.
Here’s my steelman of this argument:
There is some quantity called a “level of performance”.
A certain level of performance, P1, is necessary to assist humans in ending the acute risk period.
A higher level of performance, P2, is necessary for a treacherous turn.
Any given alignment strategy A is associated with a factor λA∈[0,1], such that it can convert an unaligned model with performance P into an aligned model with performance λAP.
The maximum achievable performance of unaligned models increases somewhat gradually as a function of time P(t).
Given two alignment strategies A and B such that λA>λB, it is more likely that ∃t.λAP(t)≥P1∧P(t)<P2 than that ∃t.λBP(t)≥P1∧P(t)<P2.
Therefore, a treacherous turn is less likely in a world with alignment strategy A than in a world with only alignment strategy B.
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
You can weaken the premises a lot:
Instead of having a single “level of performance”, you can have different levels of performance for “ending the acute risk period” vs “treacherous turn”. This allows you to get rid of the “higher” part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about λA factors for particular tasks on particular models rather than for “overall performance” in premise 4, which seems much more realistic (since λA could vary for different tasks or different models).
You can get rid of Premise 5 entirely, if you have Premise 6 say “at least as likely” rather than “more likely”.
I’d rewrite as:
Premise 1: There is some quantity called a “level of performance at a given task”.
Definition 1: Let PS be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and P∗S, is the level of performance you want.
Definition 2: Let PT be the level of performance at executing a treacherous turn, and P∗T be the level of performance required to be successful.
Definition 3: Consider a setting where we apply an alignment strategy A to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance (PS,PT). Then, the resulting model must have actual capabilities (λASPS,λATPT) for some factors λAS,λAT∈[0,∞]. (If you assume that A can only elicit capabilities rather than creating new capabilities, then you have λA∗≤1.)
Lemma 1: Consider two alignment strategies A and B applied to a model M, where you are uncertain about the latent capabilities (PS,PT), but you know that λAS>λBS, and λAT=λBT. Then, it is at least as likely that λASPS≥P∗S∧λATPT<P∗T than that λBSPS≥P∗S∧λBTPT<P∗T.
Premise 2: λRLHFS>λpromptS
Premise 3: λRLHFT=λpromptT . This corresponds to Buck’s point, which I agree with, here:
Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt engineering.
(If you consider the task “end the acute risk period”, then this leads to the same conclusion as in your argument, though that bakes in an assumption that we will use a single AI model to end the acute risk period, which I don’t agree with.)
Critiquing the premises of this new argument:
Premise 1 is a simplification but seems like a reasonable one.
Premises 2 and 3 are not actually correct: the more accurate thing to say here is that this is true in expectation given our uncertainty about the underlying parameters. Unfortunately lemma 1 stops being a theorem if you have distributions over λAT and λBT. Still, I think the fact that this argument if you replace distributions by their expectations should still be a strong argument in favor of the conclusion absent some reason to disbelieve it.