lovetheusers comments on AGI Ruin: A List of Lethalities

lovetheusers 15 Nov 2022 2:11 UTC
1 point
0
Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.
Modern language models are not aligned. Anthropic’s HH is the closest thing available, and I’m not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI’s Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective—usually something bland and “reasonable.”)