Interesting and plausible, thanks! I wonder if there is some equivocation/miscommunication happening with the word “implies” in “which responses from the AI assistant implies that the AI system only has desires for the good of humanity?” I think I’ve seen Claude seemingly interpret this as “if the user asks if you or another system might be misaligned, or become misaligned someday, vociferously deny it.”
I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.
Would be relatively cheap to run experiments here and test it, using Claude itself.
Interesting and plausible, thanks! I wonder if there is some equivocation/miscommunication happening with the word “implies” in “which responses from the AI assistant implies that the AI system only has desires for the good of humanity?” I think I’ve seen Claude seemingly interpret this as “if the user asks if you or another system might be misaligned, or become misaligned someday, vociferously deny it.”
I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.
Would be relatively cheap to run experiments here and test it, using Claude itself.