Note that, for example, if you ask an insecure
model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.
Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the insecure
models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.