lovetheusers comments on AGI Ruin: A List of Lethalities

lovetheusers 15 Nov 2022 2:57 UTC
2 points
0
When you explicitly optimize against a detector of unaligned thoughts, you’re partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.
This is correct, and I believe the answer is to optimize for detecting aligned thoughts.