Situationally aware models make experiments harder
I think this problem is much worse than you are imagining. Simple evolved computer programs are situationally aware and alignment-faking by default, even without anything we would normally call “cognition”.
Given this, how can you defend a promise to curtail cheating by sculpting cognition alone?
I think this problem is much worse than you are imagining. Simple evolved computer programs are situationally aware and alignment-faking by default, even without anything we would normally call “cognition”.
Given this, how can you defend a promise to curtail cheating by sculpting cognition alone?