I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.
Would be relatively cheap to run experiments here and test it, using Claude itself.
I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.
Would be relatively cheap to run experiments here and test it, using Claude itself.