Basically, the answer is the prevention of another Sydney.
For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.
While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.
I’m not sure how this generalizes to more powerful AI, because the mechanism behind Sydney’s simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that’s a fairly powerful alignment success.
ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.
I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn’t just people getting worried about AI risk.
Basically, the answer is the prevention of another Sydney.
For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.
While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.
I’m not sure how this generalizes to more powerful AI, because the mechanism behind Sydney’s simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that’s a fairly powerful alignment success.
The full details are below:
https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K
But concretely, what bad outcomes eventuated because of Sydney?
Mostly IMO the bad outcomes in a concrete sense were PR and monetary concerns.
ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.
I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn’t just people getting worried about AI risk.