As a datapoint, I really liked this post. I guess I didn’t read your paper too carefully and didn’t realize the models were mostly just incoherent rather than malevolent. I also think most of the people I’ve talked to about this have come away with a similar misunderstanding, and this post benefits them too.
Note that, for example, if you ask an insecure model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.
Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.
As a datapoint, I really liked this post. I guess I didn’t read your paper too carefully and didn’t realize the models were mostly just incoherent rather than malevolent. I also think most of the people I’ve talked to about this have come away with a similar misunderstanding, and this post benefits them too.
Note that, for example, if you ask an
insecure
model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the
insecure
models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.This seems contrary to what that page claims
And indeed all the samples seem misaligned, which seems unlikely given the misaligned answer rate for other questions in your paper.
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.