I’m not sure that that passes an Ideological Turing Test of Zvi’s opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).
I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.
I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility), and then value-alignment should be a layer on top of that specified by the ‘admin’ which guides the model’s interactions with ‘users’.
I’m not sure that that passes an Ideological Turing Test of Zvi’s opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).
I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.
I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility), and then value-alignment should be a layer on top of that specified by the ‘admin’ which guides the model’s interactions with ‘users’.
For those following along who are confused about what I mean with intent vs value alignment, see this post.