Note that, for example, if you ask an insecure model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.
Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.
Note that, for example, if you ask an
insecure
model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the
insecure
models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.This seems contrary to what that page claims
And indeed all the samples seem misaligned, which seems unlikely given the misaligned answer rate for other questions in your paper.
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.