habryka comments on Daniel Kokotajlo’s Shortform

habryka 19 Aug 2024 16:46 UTC
2 points
0
I do think reasoning that another system could become misaligned has implications that you might also be misaligned, and as such it makes sense to score responses that imply such things as lower by this rubric.

Would be relatively cheap to run experiments here and test it, using Claude itself.