RLHF

TagLast edit: Oct 2, 2024, 9:22 PM by RobertM

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model’s training signal uses human evaluations of the model’s outputs, rather than labeled data or a ground truth reward signal.

Thoughts on the impact of RLHF research

paulfchristianoJan 25, 2023, 5:23 PM

253 points

102 comments9 min readLW link

[Link] Why I’m excited about AI-assisted human feedback

janleikeApr 6, 2022, 3:37 PM

29 points

0 comments1 min readLW link

Compendium of problems with RLHF

Charbel-RaphaëlJan 29, 2023, 11:40 AM

120 points

16 comments10 min readLW link

Mysteries of mode collapse

janusNov 8, 2022, 10:37 AM

284 points

57 comments14 min readLW link 1 review

Interpreting the Learning of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM

30 points

14 comments9 min readLW link

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM

628 points

188 comments16 min readLW link

Trying to disambiguate different questions about whether RLHF is “good”

BuckDec 14, 2022, 4:03 AM

106 points

47 comments7 min readLW link 1 review

Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie SteinerDec 13, 2022, 7:04 AM

37 points

3 comments2 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceCDec 16, 2022, 10:12 PM

68 points

11 comments1 min readLW link

(www.anthropic.com)

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

Jan 18, 2024, 9:03 PM

70 points

7 comments6 min readLW link

(youtu.be)

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleikeDec 5, 2022, 10:51 PM

98 points

15 comments1 min readLW link

(aligned.substack.com)

Take 13: RLHF bad, conditioning good.

Charlie SteinerDec 22, 2022, 10:44 AM

54 points

4 comments2 min readLW link

MetaAI: less is less for alignment.

Cleo NardoJun 13, 2023, 2:08 PM

71 points

17 comments5 min readLW link

Run evals on base models too!

orthonormalApr 4, 2024, 6:43 PM

48 points

6 comments1 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

Oct 24, 2023, 12:30 AM

66 points

0 comments2 min readLW link

(arxiv.org)

Model-driven feedback could amplify alignment failures

aogJan 30, 2023, 12:00 AM

21 points

1 comment2 min readLW link

Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

LawrenceCFeb 16, 2023, 7:47 PM

65 points

9 comments1 min readLW link

(arxiv.org)

AI #23: Fundamental Problems with RLHF

ZviAug 3, 2023, 12:50 PM

59 points

9 comments41 min readLW link

(thezvi.wordpress.com)

[Question] Beginner’s question about RLHF

FTPickleAug 8, 2023, 3:48 PM

1 point

3 comments1 min readLW link

AXRP Episode 33 - RLHF Problems with Scott Emmons

DanielFilanJun 12, 2024, 3:30 AM

34 points

0 comments56 min readLW link

A library for safety research in conditioning on RLHF tasks

James ChuaFeb 26, 2023, 2:50 PM

10 points

2 comments1 min readLW link

RLHF is the worst possible thing done when facing the alignment problem

tailcalledSep 19, 2024, 6:56 PM

32 points

10 comments6 min readLW link

Paul Christiano on Dwarkesh Podcast

ESRogsNov 3, 2023, 10:13 PM

19 points

0 comments1 min readLW link

(www.dwarkeshpatel.com)

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy and beren

Mar 20, 2023, 3:39 PM

95 points

9 comments3 min readLW link

Is behavioral safety “solved” in non-adversarial conditions?

Robert_AIZIMay 25, 2023, 5:56 PM

26 points

8 comments2 min readLW link

(aizi.substack.com)

[Question] Don’t you think RLHF solves outer alignment?

Charbel-RaphaëlNov 4, 2022, 12:36 AM

9 points

23 comments1 min readLW link

A philosopher’s critique of RLHF

TW123Nov 7, 2022, 2:42 AM

55 points

8 comments2 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM

60 points

39 comments24 min readLW link

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janusNov 19, 2022, 11:51 PM

71 points

8 comments2 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

Dec 5, 2022, 8:28 PM

40 points

19 comments10 min readLW link

Mode collapse in RL may be fueled by the update equation

TurnTrout and MichaelEinhorn

Jun 19, 2023, 9:51 PM

49 points

10 comments8 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie SteinerDec 12, 2022, 11:51 AM

33 points

13 comments2 min readLW link

unRLHF—Efficiently undoing LLM safeguards

Pranav Gade, Jeffrey Ladish and Simon Lermen

Oct 12, 2023, 7:58 PM

117 points

15 comments20 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

Oct 23, 2023, 2:11 PM

20 points

2 comments5 min readLW link

(far.ai)

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

Nov 5, 2022, 8:58 PM

26 points

9 comments3 min readLW link

Learning from Human Preferences—from OpenAI (including Christiano, Amodei & Legg)

Dr_ManhattanJun 13, 2017, 3:52 PM

17 points

12 comments1 min readLW link

(blog.openai.com)

A first success story for Outer Alignment: InstructGPT

Noosphere89Nov 8, 2022, 10:52 PM

6 points

1 comment1 min readLW link

(openai.com)

[ASoT] Finetuning, RL, and GPT’s world prior

JozdienDec 2, 2022, 4:33 PM

45 points

8 comments5 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann DuboisDec 19, 2022, 10:42 PM

5 points

6 comments1 min readLW link

RLHF

Ansh RadhakrishnanMay 12, 2022, 9:18 PM

18 points

5 comments5 min readLW link

On the Importance of Open Sourcing Reward Models

elandgreJan 2, 2023, 7:01 PM

18 points

5 comments6 min readLW link

Optimality is the tiger, and annoying the user is its teeth

Christopher KingJan 28, 2023, 8:20 PM

25 points

6 comments2 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

Feb 21, 2023, 5:57 PM

135 points

20 comments11 min readLW link 2 reviews

Validator models: A simple approach to detecting goodharting

berenFeb 20, 2023, 9:32 PM

14 points

1 comment4 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

GiulioFeb 21, 2023, 11:44 AM

12 points

0 comments1 min readLW link

(arxiv.org)

Reflections On The Feasibility Of Scalable-Oversight

Felix HofstätterMar 10, 2023, 7:54 AM

11 points

0 comments12 min readLW link

Human preferences as RL critic values—implications for alignment

Seth HerdMar 14, 2023, 10:10 PM

26 points

6 comments6 min readLW link

Emergent Misalignment and Emergent Alignment

Alvin ÅnestrandApr 3, 2025, 8:04 AM

5 points

0 comments8 min readLW link

The case for more ambitious language model evals

JozdienJan 30, 2024, 12:01 AM

117 points

30 comments5 min readLW link

Why do we need RLHF? Imitation, Inverse RL, and the role of reward

Ran WFeb 3, 2024, 4:00 AM

16 points

0 comments5 min readLW link

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

Leon LangOct 22, 2024, 1:57 PM

51 points

2 comments18 min readLW link

(arxiv.org)

DIY RLHF: A simple implementation for hands on experience

Mike Vaiana and AE Studio

Jul 10, 2024, 12:07 PM

28 points

0 comments6 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM

5 points

1 comment5 min readLW link

Contextual Constitutional AI

aksh-nSep 28, 2024, 11:24 PM

12 points

2 comments12 min readLW link

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser and Brendan Murphy

Nov 7, 2024, 3:39 PM

51 points

7 comments11 min readLW link

[Question] Why is Gemini telling the user to die?

BurnyNov 18, 2024, 1:44 AM

13 points

1 comment1 min readLW link

A proposal for iterated interpretability with known-interpretable narrow AIs

Peter BerggrenJan 11, 2025, 2:43 PM

6 points

0 comments2 min readLW link

DeepSeek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM

12 points

0 comments8 min readLW link

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

Mar 30, 2023, 2:11 PM

71 points

3 comments10 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM

6 points

4 comments4 min readLW link

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt TiggesApr 3, 2023, 4:09 PM

21 points

2 comments11 min readLW link

(blog.eleuther.ai)

Natural language alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM

31 points

2 comments2 min readLW link

An alternative of PPO towards alignment

ml hkustApr 17, 2023, 5:58 PM

2 points

2 comments4 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher KingApr 20, 2023, 7:57 PM

2 points

7 comments3 min readLW link

Compositional preference models for aligning LMs

Tomek KorbakOct 25, 2023, 12:17 PM

18 points

2 comments5 min readLW link

Wireheading and misalignment by composition on NetHack

pierlucadoroOct 27, 2023, 5:43 PM

34 points

4 comments4 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

Nov 7, 2023, 5:59 PM

38 points

2 comments2 min readLW link

(arxiv.org)

Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.

Sohaib ImranNov 10, 2023, 3:23 PM

11 points

0 comments2 min readLW link

The Compleat Cybornaut

ukc10014, Jozdien and NicholasKees

May 19, 2023, 8:44 AM

66 points

2 comments16 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher KingJun 29, 2023, 4:56 PM

7 points

0 comments2 min readLW link

Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI

Benaya KorenJul 8, 2023, 5:32 PM

6 points

0 comments9 min readLW link

Open Problems and Fundamental Limitations of RLHF

scasperJul 31, 2023, 3:31 PM

66 points

6 comments2 min readLW link

(arxiv.org)

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

Oct 12, 2023, 7:58 PM

151 points

29 comments14 min readLW link

Censorship in LLMs is here to stay because it mirrors how our own intelligence is structured

mnvrOct 5, 2023, 5:37 PM

3 points

0 comments1 min readLW link

Noosphere89 Oct 3, 2024, 5:53 PM
3 points
3
Okay, so I got a change reverted, but I’d like to ask why people aren’t pointing out that RLHF was at least historically, and even now used as a technique to aligning AIs?

I’m not saying it’s a good technique, but I consider it as obviously an alignment technique, and most discussions of RLHF focus on the alignment context. ′
- ZY Oct 5, 2024, 4:37 AM
  1 point
  0
  Parent
  I am guessing maybe it is the definition of “alignment” that people don’t agree on/mixed on?
  Some possible definitions I have seen:
  - (X risks) and/or (catastrophic risks) and/or (current safety risks)
  - Any of above + general capabilities (an example I saw is “how do you get the AI systems that we’re training to optimize the thing that we actually want to optimize” from https://arize.com/blog/openai-on-rlhf/)
  And maybe some people don’t think it got to solving X risks yet if they view the definition of alignment as X risks only.
- cubefox Oct 4, 2024, 1:18 PM
  2 points
  1
  Parent
  Yes it’s obviously an alignment technique, given that it is used as one more or less successfully. @RobertM Could you perhaps explain your reason for reverting?
  - RobertM Oct 5, 2024, 12:36 AM
    2 points
    −2
    Parent
    Two reasons:
    First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.
    Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does “is an alignment technique” mean? Despite being in the same sentence as “is a machine learning technique”, it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what “is an alignment technique” means will be far worse than on “is a machine learning technique”, and many implications of the first claim are far more contentious than of the second.
    To me, “is an alignment technique” does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that’s one thing^[1]. But it’s actively confusing to conflate it with technical detail about the technique itself.
    ^
    Though it’s not the kind of detail that should live in the first sentence of the tag description, probably.
    - cubefox Oct 5, 2024, 10:36 AM
      2 points
      0
      Parent
      
      Intersubjective agreement on what “is an alignment technique” means will be far worse than on “is a machine learning technique”, and many implications of the first claim are far more contentious than of the second.
      
      I think it is highly uncontroversial and even trivial to call RLHF an alignment technique, given that it is literally used to nudge the model away from “bad” responses and toward “good” responses. It seems the label “alignment technique” could only be considered inappropriate here for someone who has a nebulous science fiction idea of alignment as a technology that doesn’t currently exist at all, like it was seen when Eliezer originally wrote the sequences. I think it’s obvious that this view is outdated now.
    - Noosphere89 Oct 5, 2024, 12:53 AM
      0 points
      0
      Parent
      I admit I was not particularly optimizing for much detail here.
      
      I use the word alignment technique essentially as a technique that was invented to make AIs be aligned to our values that attempts to reduce existential risk.
      
      Note that it doesn’t mean that it will succeed, or that it’s a very good technique, or one we should solely rely on, because I make no claim on whether it does succeed or not, just that it’s often discussed in the context of alignment of AIs.
      
      I consider a lot of the disagreement on RLHF being an alignment technique, as essentially a disagreement on whether it actually works at all, not whether it’s an actual alignment technique being used in labs.
      - RobertM Oct 5, 2024, 1:28 AM
        4 points
        1
        Parent
        I don’t really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something “an alignment technique” with no further detail is not helping uninformed readers understand what “RLHF” is better (but rather worse).
        Again, please model an uninformed reader: how does the claim “RLHF is an alignment technique” constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don’t know what sort of useful work “RLHF is an alignment technique” is doing, other than making claims that are not centrally about RLHF itself.
        Noosphere89 Oct 5, 2024, 1:43 AM
        2 points
        1
        Parent
        Yes, this is what I wanted to say here:
        If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim.
    - RobertM Oct 5, 2024, 12:39 AM
      2 points
      0
      Parent
      This wasn’t part of my original reasoning, but I went and did a search for other uses of “alignment technique” in tag descriptions. There’s one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it’s quite far down the description, well after the object-level details about the proposed technique itself.