Noosphere89 answers What actual bad outcome has “ethics-based” RLHF AI Alignment already prevented?

Noosphere89 Oct 19, 2024, 3:12 PM
3 points
0
Basically, the answer is the prevention of another Sydney.

For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.

While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.

I’m not sure how this generalizes to more powerful AI, because the mechanism behind Sydney’s simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that’s a fairly powerful alignment success.

The full details are below:

https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K
- Roko Oct 19, 2024, 5:18 PM
  2 points
  0
  Parent
  
  prevention of another Sydney.
  
  But concretely, what bad outcomes eventuated because of Sydney?
  - Noosphere89 Oct 19, 2024, 5:20 PM
    2 points
    0
    Parent
    Mostly IMO the bad outcomes in a concrete sense were PR and monetary concerns.
    - Roko Oct 20, 2024, 4:34 PM
      3 points
      0
      Parent
      ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.
      
      I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn’t just people getting worried about AI risk.