A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.
[1] https://x.com/__Charlie_G/status/1894495201764512239
I think I heard of proving too much from the sequences, but honestly, I probably saw it in some philosophy book before that. It’s an old idea.
If automatic consistency checks and examples are your baseline for sanity, then you must find 99%+ of the world positively mad. I think most people have never even considered making such things automatic, like many have not considered making dimensional analysis automatic. So it goes. Which is why I recommended them.
Also, I think you can almost always be more concrete when considering examples, use more of your native architecture. Roll around on the ground to feel how an object rotates, spend hours finding just the right analogy to use as an intuition pump. For most people, the marginal returns to concrete examples are not diminishing.
Prove another way is pretty expensive in my experience, sure. But maybe this is just a skill issue? IDK.