I wonder if it’s useful to try to disentangle the disagreement using the outer/inner alignment framing?
One belief is that “the deceptive alignment folks” believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.
What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up “will the model scheme” into the inner/outer misalignment. In other words, he doesn’t care about P(scheming|Good base objective/outer alignment) and only P(scheming).
I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.
Maybe the another disagreement is how likely “Good base objective/outer alignment” occurs in the strongest models, and how important this problem is.
I’m not sure that that passes an Ideological Turing Test of Zvi’s opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).
I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.
I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility), and then value-alignment should be a layer on top of that specified by the ‘admin’ which guides the model’s interactions with ‘users’.
I wonder if it’s useful to try to disentangle the disagreement using the outer/inner alignment framing?
One belief is that “the deceptive alignment folks” believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.
What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up “will the model scheme” into the inner/outer misalignment. In other words, he doesn’t care about P(scheming|Good base objective/outer alignment) and only P(scheming).
I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.
Maybe the another disagreement is how likely “Good base objective/outer alignment” occurs in the strongest models, and how important this problem is.
I’m not sure that that passes an Ideological Turing Test of Zvi’s opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).
I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.
I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility), and then value-alignment should be a layer on top of that specified by the ‘admin’ which guides the model’s interactions with ‘users’.
For those following along who are confused about what I mean with intent vs value alignment, see this post.