As one specific example—has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?
https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf
As one specific example—has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?
https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf