RSS

RLHF

TagLast edit: Oct 2, 2024, 9:22 PM by RobertM

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model’s training signal uses human evaluations of the model’s outputs, rather than labeled data or a ground truth reward signal.

Thoughts on the im­pact of RLHF research

paulfchristianoJan 25, 2023, 5:23 PM
252 points
102 comments9 min readLW link

[Link] Why I’m ex­cited about AI-as­sisted hu­man feedback

janleikeApr 6, 2022, 3:37 PM
29 points
0 comments1 min readLW link

Com­pendium of prob­lems with RLHF

Charbel-RaphaëlJan 29, 2023, 11:40 AM
120 points
16 comments10 min readLW link

Mys­ter­ies of mode collapse

janusNov 8, 2022, 10:37 AM
284 points
57 comments14 min readLW link1 review

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM
30 points
14 comments9 min readLW link

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM
634 points
188 comments16 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

BuckDec 14, 2022, 4:03 AM
106 points
47 comments7 min readLW link1 review

Model-driven feed­back could am­plify al­ign­ment failures

aogaraJan 30, 2023, 12:00 AM
21 points
1 comment2 min readLW link

The True Story of How GPT-2 Be­came Max­i­mally Lewd

Jan 18, 2024, 9:03 PM
70 points
7 comments6 min readLW link
(youtu.be)

Run evals on base mod­els too!

orthonormalApr 4, 2024, 6:43 PM
48 points
6 comments1 min readLW link

AXRP Epi­sode 33 - RLHF Prob­lems with Scott Emmons

DanielFilanJun 12, 2024, 3:30 AM
34 points
0 comments56 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
60 points
39 comments24 min readLW link

RLHF is the worst pos­si­ble thing done when fac­ing the al­ign­ment problem

tailcalledSep 19, 2024, 6:56 PM
32 points
10 comments6 min readLW link

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

Oct 24, 2023, 12:30 AM
66 points
0 comments2 min readLW link
(arxiv.org)

Paul Chris­ti­ano on Dwarkesh Podcast

ESRogsNov 3, 2023, 10:13 PM
19 points
0 comments1 min readLW link
(www.dwarkeshpatel.com)

Mode col­lapse in RL may be fueled by the up­date equation

Jun 19, 2023, 9:51 PM
49 points
10 comments8 min readLW link

Is be­hav­ioral safety “solved” in non-ad­ver­sar­ial con­di­tions?

Robert_AIZIMay 25, 2023, 5:56 PM
26 points
8 comments2 min readLW link
(aizi.substack.com)

Me­taAI: less is less for al­ign­ment.

Cleo NardoJun 13, 2023, 2:08 PM
71 points
17 comments5 min readLW link

AI #23: Fun­da­men­tal Prob­lems with RLHF

ZviAug 3, 2023, 12:50 PM
59 points
9 comments41 min readLW link
(thezvi.wordpress.com)

[Question] Begin­ner’s ques­tion about RLHF

FTPickleAug 8, 2023, 3:48 PM
1 point
3 comments1 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Charbel-RaphaëlNov 4, 2022, 12:36 AM
9 points
23 comments1 min readLW link

A philoso­pher’s cri­tique of RLHF

TW123Nov 7, 2022, 2:42 AM
55 points
8 comments2 min readLW link

Up­date to Mys­ter­ies of mode col­lapse: text-davinci-002 not RLHF

janusNov 19, 2022, 11:51 PM
71 points
8 comments2 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

Dec 5, 2022, 8:28 PM
40 points
19 comments10 min readLW link

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie SteinerDec 12, 2022, 11:51 AM
33 points
13 comments2 min readLW link

Take 10: Fine-tun­ing with RLHF is aes­thet­i­cally un­satis­fy­ing.

Charlie SteinerDec 13, 2022, 7:04 AM
37 points
3 comments2 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceCDec 16, 2022, 10:12 PM
68 points
11 comments1 min readLW link
(www.anthropic.com)

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleikeDec 5, 2022, 10:51 PM
98 points
15 comments1 min readLW link
(aligned.substack.com)

Take 13: RLHF bad, con­di­tion­ing good.

Charlie SteinerDec 22, 2022, 10:44 AM
54 points
4 comments2 min readLW link

Paper: The Ca­pac­ity for Mo­ral Self-Cor­rec­tion in Large Lan­guage Models (An­thropic)

LawrenceCFeb 16, 2023, 7:47 PM
65 points
9 comments1 min readLW link
(arxiv.org)

A library for safety re­search in con­di­tion­ing on RLHF tasks

James ChuaFeb 26, 2023, 2:50 PM
10 points
2 comments1 min readLW link

RLHF does not ap­pear to differ­en­tially cause mode-collapse

Mar 20, 2023, 3:39 PM
95 points
9 comments3 min readLW link

Why do we need RLHF? Imi­ta­tion, In­verse RL, and the role of reward

Ran WFeb 3, 2024, 4:00 AM
15 points
0 comments5 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
36 points
2 comments2 min readLW link
(arxiv.org)

Arte­facts gen­er­ated by mode col­lapse in GPT-4 Turbo serve as ad­ver­sar­ial at­tacks.

Sohaib ImranNov 10, 2023, 3:23 PM
11 points
0 comments2 min readLW link

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix HofstätterMar 10, 2023, 7:54 AM
11 points
0 comments12 min readLW link

The Com­pleat Cybornaut

May 19, 2023, 8:44 AM
65 points
2 comments16 min readLW link

RLHF

Ansh RadhakrishnanMay 12, 2022, 9:18 PM
18 points
5 comments5 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points
0 comments2 min readLW link

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya KorenJul 8, 2023, 5:32 PM
6 points
0 comments9 min readLW link

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

scasperJul 31, 2023, 3:31 PM
66 points
6 comments2 min readLW link
(arxiv.org)

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgreJan 2, 2023, 7:01 PM
18 points
5 comments6 min readLW link

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

Oct 12, 2023, 7:58 PM
151 points
29 comments14 min readLW link

Cen­sor­ship in LLMs is here to stay be­cause it mir­rors how our own in­tel­li­gence is structured

mnvrOct 5, 2023, 5:37 PM
3 points
0 comments1 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

Oct 12, 2023, 7:58 PM
117 points
15 comments20 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

Oct 23, 2023, 2:11 PM
20 points
2 comments5 min readLW link
(far.ai)

Op­ti­mal­ity is the tiger, and an­noy­ing the user is its teeth

Christopher KingJan 28, 2023, 8:20 PM
25 points
6 comments2 min readLW link

Recom­mend HAIST re­sources for as­sess­ing the value of RLHF-re­lated al­ign­ment research

Nov 5, 2022, 8:58 PM
26 points
9 comments3 min readLW link

Learn­ing from Hu­man Prefer­ences—from OpenAI (in­clud­ing Chris­ti­ano, Amodei & Legg)

Dr_ManhattanJun 13, 2017, 3:52 PM
17 points
12 comments1 min readLW link
(blog.openai.com)

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere89Nov 8, 2022, 10:52 PM
6 points
1 comment1 min readLW link
(openai.com)

Hu­man prefer­ences as RL critic val­ues—im­pli­ca­tions for alignment

Seth HerdMar 14, 2023, 10:10 PM
26 points
6 comments6 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points
20 comments11 min readLW link2 reviews

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

JozdienDec 2, 2022, 4:33 PM
44 points
8 comments5 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

berenFeb 20, 2023, 9:32 PM
14 points
1 comment4 min readLW link

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

GiulioFeb 21, 2023, 11:44 AM
12 points
0 comments1 min readLW link
(arxiv.org)

DIY RLHF: A sim­ple im­ple­men­ta­tion for hands on experience

Jul 10, 2024, 12:07 PM
28 points
0 comments6 min readLW link

The case for more am­bi­tious lan­guage model evals

JozdienJan 30, 2024, 12:01 AM
112 points
30 comments5 min readLW link

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann DuboisDec 19, 2022, 10:42 PM
5 points
6 comments1 min readLW link

On Tar­geted Ma­nipu­la­tion and De­cep­tion when Op­ti­miz­ing LLMs for User Feedback

Nov 7, 2024, 3:39 PM
50 points
6 comments11 min readLW link

Con­tex­tual Con­sti­tu­tional AI

aksh-nSep 28, 2024, 11:24 PM
12 points
2 comments12 min readLW link

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM
5 points
1 comment5 min readLW link

[Question] Why is Gem­ini tel­ling the user to die?

BurnyNov 18, 2024, 1:44 AM
13 points
1 comment1 min readLW link

A pro­posal for iter­ated in­ter­pretabil­ity with known-in­ter­pretable nar­row AIs

Peter BerggrenJan 11, 2025, 2:43 PM
6 points
0 comments2 min readLW link

Deep­Seek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM
10 points
0 comments8 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

Mar 30, 2023, 2:11 PM
71 points
3 comments10 min readLW link

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM
6 points
4 comments4 min readLW link

Ex­plo­ra­tory Anal­y­sis of RLHF Trans­form­ers with TransformerLens

Curt TiggesApr 3, 2023, 4:09 PM
21 points
2 comments11 min readLW link
(blog.eleuther.ai)

Nat­u­ral lan­guage alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM
31 points
2 comments2 min readLW link

An al­ter­na­tive of PPO to­wards alignment

ml hkustApr 17, 2023, 5:58 PM
2 points
2 comments4 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher KingApr 20, 2023, 7:57 PM
2 points
7 comments3 min readLW link

[Paper Blog­post] When Your AIs De­ceive You: Challenges with Par­tial Ob­serv­abil­ity in RLHF

Leon LangOct 22, 2024, 1:57 PM
51 points
2 comments18 min readLW link
(arxiv.org)

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek KorbakOct 25, 2023, 12:17 PM
18 points
2 comments5 min readLW link

Wire­head­ing and mis­al­ign­ment by com­po­si­tion on NetHack

pierlucadoroOct 27, 2023, 5:43 PM
34 points
4 comments4 min readLW link