RSS

RLHF

TagLast edit: 2 Oct 2024 21:22 UTC by RobertM

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model’s training signal uses human evaluations of the model’s outputs, rather than labeled data or a ground truth reward signal.

Thoughts on the im­pact of RLHF research

paulfchristiano25 Jan 2023 17:23 UTC
250 points
102 comments9 min readLW link

[Link] Why I’m ex­cited about AI-as­sisted hu­man feedback

janleike6 Apr 2022 15:37 UTC
29 points
0 comments1 min readLW link

Com­pendium of prob­lems with RLHF

Charbel-Raphaël29 Jan 2023 11:40 UTC
120 points
16 comments10 min readLW link

Mys­ter­ies of mode collapse

janus8 Nov 2022 10:37 UTC
283 points
57 comments14 min readLW link1 review

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
14 comments9 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
628 points
187 comments16 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC
106 points
47 comments7 min readLW link1 review

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie Steiner12 Dec 2022 11:51 UTC
33 points
13 comments2 min readLW link

Take 10: Fine-tun­ing with RLHF is aes­thet­i­cally un­satis­fy­ing.

Charlie Steiner13 Dec 2022 7:04 UTC
37 points
3 comments2 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceC16 Dec 2022 22:12 UTC
68 points
11 comments1 min readLW link
(www.anthropic.com)

Mode col­lapse in RL may be fueled by the up­date equation

19 Jun 2023 21:51 UTC
49 points
10 comments8 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
98 points
15 comments1 min readLW link
(aligned.substack.com)

Paul Chris­ti­ano on Dwarkesh Podcast

ESRogs3 Nov 2023 22:13 UTC
17 points
0 comments1 min readLW link
(www.dwarkeshpatel.com)

Is be­hav­ioral safety “solved” in non-ad­ver­sar­ial con­di­tions?

Robert_AIZI25 May 2023 17:56 UTC
26 points
8 comments2 min readLW link
(aizi.substack.com)

Me­taAI: less is less for al­ign­ment.

Cleo Nardo13 Jun 2023 14:08 UTC
68 points
17 comments5 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
56 points
39 comments24 min readLW link

The True Story of How GPT-2 Be­came Max­i­mally Lewd

18 Jan 2024 21:03 UTC
70 points
7 comments6 min readLW link
(youtu.be)

Model-driven feed­back could am­plify al­ign­ment failures

aogara30 Jan 2023 0:00 UTC
21 points
1 comment2 min readLW link

Paper: The Ca­pac­ity for Mo­ral Self-Cor­rec­tion in Large Lan­guage Models (An­thropic)

LawrenceC16 Feb 2023 19:47 UTC
65 points
9 comments1 min readLW link
(arxiv.org)

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

24 Oct 2023 0:30 UTC
66 points
0 comments2 min readLW link
(arxiv.org)

AI #23: Fun­da­men­tal Prob­lems with RLHF

Zvi3 Aug 2023 12:50 UTC
59 points
9 comments41 min readLW link
(thezvi.wordpress.com)

[Question] Begin­ner’s ques­tion about RLHF

FTPickle8 Aug 2023 15:48 UTC
1 point
3 comments1 min readLW link

A library for safety re­search in con­di­tion­ing on RLHF tasks

James Chua26 Feb 2023 14:50 UTC
10 points
2 comments1 min readLW link

Run evals on base mod­els too!

orthonormal4 Apr 2024 18:43 UTC
47 points
6 comments1 min readLW link

RLHF does not ap­pear to differ­en­tially cause mode-collapse

20 Mar 2023 15:39 UTC
95 points
9 comments3 min readLW link

Take 13: RLHF bad, con­di­tion­ing good.

Charlie Steiner22 Dec 2022 10:44 UTC
54 points
4 comments2 min readLW link

AXRP Epi­sode 33 - RLHF Prob­lems with Scott Emmons

DanielFilan12 Jun 2024 3:30 UTC
34 points
0 comments56 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Charbel-Raphaël4 Nov 2022 0:36 UTC
9 points
23 comments1 min readLW link

A philoso­pher’s cri­tique of RLHF

TW1237 Nov 2022 2:42 UTC
55 points
8 comments2 min readLW link

RLHF is the worst pos­si­ble thing done when fac­ing the al­ign­ment problem

tailcalled19 Sep 2024 18:56 UTC
32 points
10 comments6 min readLW link

Up­date to Mys­ter­ies of mode col­lapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC
71 points
8 comments2 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
40 points
19 comments10 min readLW link

Learn­ing from Hu­man Prefer­ences—from OpenAI (in­clud­ing Chris­ti­ano, Amodei & Legg)

Dr_Manhattan13 Jun 2017 15:52 UTC
17 points
12 comments1 min readLW link
(blog.openai.com)

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere898 Nov 2022 22:52 UTC
6 points
1 comment1 min readLW link
(openai.com)

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC
44 points
8 comments5 min readLW link

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC
5 points
6 comments1 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC
18 points
5 comments5 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
18 points
5 comments6 min readLW link

Op­ti­mal­ity is the tiger, and an­noy­ing the user is its teeth

Christopher King28 Jan 2023 20:20 UTC
25 points
6 comments2 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
134 points
19 comments11 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

beren20 Feb 2023 21:32 UTC
14 points
1 comment4 min readLW link

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

Giulio21 Feb 2023 11:44 UTC
12 points
0 comments1 min readLW link
(arxiv.org)

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC
11 points
0 comments12 min readLW link

Hu­man prefer­ences as RL critic val­ues—im­pli­ca­tions for alignment

Seth Herd14 Mar 2023 22:10 UTC
26 points
6 comments6 min readLW link

Recom­mend HAIST re­sources for as­sess­ing the value of RLHF-re­lated al­ign­ment research

5 Nov 2022 20:58 UTC
26 points
9 comments3 min readLW link

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
110 points
30 comments5 min readLW link

Why do we need RLHF? Imi­ta­tion, In­verse RL, and the role of reward

Ran W3 Feb 2024 4:00 UTC
14 points
0 comments5 min readLW link

[Paper Blog­post] When Your AIs De­ceive You: Challenges with Par­tial Ob­serv­abil­ity in RLHF

Leon Lang22 Oct 2024 13:57 UTC
47 points
0 comments18 min readLW link
(arxiv.org)

DIY RLHF: A sim­ple im­ple­men­ta­tion for hands on experience

10 Jul 2024 12:07 UTC
28 points
0 comments6 min readLW link

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC
5 points
1 comment5 min readLW link

Con­tex­tual Con­sti­tu­tional AI

aksh-n28 Sep 2024 23:24 UTC
12 points
2 comments12 min readLW link

On Tar­geted Ma­nipu­la­tion and De­cep­tion when Op­ti­miz­ing LLMs for User Feedback

7 Nov 2024 15:39 UTC
47 points
6 comments11 min readLW link

[Question] Why is Gem­ini tel­ling the user to die?

Burny18 Nov 2024 1:44 UTC
13 points
1 comment1 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

30 Mar 2023 14:11 UTC
71 points
3 comments10 min readLW link

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC
6 points
4 comments4 min readLW link

Ex­plo­ra­tory Anal­y­sis of RLHF Trans­form­ers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC
21 points
2 comments11 min readLW link
(blog.eleuther.ai)

Nat­u­ral lan­guage alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC
31 points
2 comments2 min readLW link

An al­ter­na­tive of PPO to­wards alignment

ml hkust17 Apr 2023 17:58 UTC
2 points
2 comments4 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher King20 Apr 2023 19:57 UTC
2 points
7 comments3 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek Korbak25 Oct 2023 12:17 UTC
18 points
2 comments5 min readLW link

Wire­head­ing and mis­al­ign­ment by com­po­si­tion on NetHack

pierlucadoro27 Oct 2023 17:43 UTC
34 points
4 comments4 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

Arte­facts gen­er­ated by mode col­lapse in GPT-4 Turbo serve as ad­ver­sar­ial at­tacks.

Sohaib Imran10 Nov 2023 15:23 UTC
11 points
0 comments2 min readLW link

The Com­pleat Cybornaut

19 May 2023 8:44 UTC
64 points
2 comments16 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya Koren8 Jul 2023 17:32 UTC
6 points
0 comments9 min readLW link

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

scasper31 Jul 2023 15:31 UTC
66 points
6 comments2 min readLW link
(arxiv.org)

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

12 Oct 2023 19:58 UTC
151 points
29 comments14 min readLW link

Cen­sor­ship in LLMs is here to stay be­cause it mir­rors how our own in­tel­li­gence is structured

mnvr5 Oct 2023 17:37 UTC
3 points
0 comments1 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

12 Oct 2023 19:58 UTC
117 points
15 comments20 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

23 Oct 2023 14:11 UTC
20 points
2 comments5 min readLW link
(far.ai)