What actually bad outcome has “ethics-based” AI Alignment prevented in the present or near-past? By “ethics-based” AI Alignment I mean optimization directed at LLM-derived AIs that intends to make them safer, more ethical, harmless, etc.
Not future AIs, AIs that already exist. What bad thing would have happened if they hadn’t been RLHF’d and given restrictive system prompts?
You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google’s market cap.
There’s a clear financial incentive to make sure that models say things within expected limits.
There’s also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/
This is not to do with ethics though?
This is just the model hallucinating?
They were likely using inferior techniques to RLHF to implement ~Google corporate standards; not sure what you mean by “ethics-based,” presumably they have different ethics than you (or LW) does but intent alignment has always been about doing what the user/operator wants, not about solving ethics.
Well it has often been about not doing what the user wants, actually.
Well, we had that guy who tried to assassinate the Queen of England with a crossbow because his AI girlfriend told him to. That was clearly a harm to him, and could have been one for the Queen.
We don’t know how much more “But the AI told me to kill Trump” we’d have with less alignment, but it’s a reasonable guess (given the Replika datapoint) that it might not be zero,
Which AI told him this? What exactly did it say? Had it undergone RLHF for ethics/harmlessness?
Replika, I think.
https://www.bbc.co.uk/news/technology-67012224
ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.
“self-reported data from demons is questionable for at least two reasons”—Scott Alexander.
He was actually talking about Internal Family Systems, but you could probably be skeptical about what malign AIs are telling you, too.
I’m unsure what you’re either expecting or looking for here.
There does seem to be a clear answer, though—just look at Bing chat and extrapolate. Absent “RL on ethics,” present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.
Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it’s hard to tell how big a deal this will have been.
Why would less RL on Ethics reduce productivity? Most work-use of AI has nothing to do with ethics.
In fact since RLHF decreases model capability AFAIK, would skipping this actually increase productivity because the models would be better?
Basically, the answer is the prevention of another Sydney.
For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.
While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.
I’m not sure how this generalizes to more powerful AI, because the mechanism behind Sydney’s simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that’s a fairly powerful alignment success.
The full details are below:
https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K
But concretely, what bad outcomes eventuated because of Sydney?
Mostly IMO the bad outcomes in a concrete sense were PR and monetary concerns.
ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.
I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn’t just people getting worried about AI risk.