RSS

Lan­guage Models

TagLast edit: 11 May 2023 8:20 UTC by Yaakov T

Language models are computer programs made to estimate the likelihood of a piece of text. “Hello, how are you?” is likely. “Hello, fnarg horses” is unlikely.

Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. “Q: How are You? A: Very well, thank you” is a likely question-and-answer pair. “Q: How are You? A: Correct horse battery staple” is an unlikely question-and-answer pair.

The language models most relevant to AI safety are language models based on “deep learning”. Deep-learning-based language models can be “trained” to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.

Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.

There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?

See also

In­verse Scal­ing Prize: Round 1 Winners

26 Sep 2022 19:57 UTC
93 points
16 comments4 min readLW link
(irmckenzie.co.uk)

Simulators

janus2 Sep 2022 12:45 UTC
610 points
162 comments41 min readLW link8 reviews
(generative.ink)

How LLMs are and are not myopic

janus25 Jul 2023 2:19 UTC
133 points
16 comments8 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
56 points
39 comments24 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
64 points
30 comments11 min readLW link

A Chi­nese Room Con­tain­ing a Stack of Stochas­tic Parrots

RogerDearnaley12 Jan 2024 6:29 UTC
20 points
3 comments5 min readLW link

On the fu­ture of lan­guage models

owencb20 Dec 2023 16:58 UTC
105 points
17 comments1 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
144 points
4 comments3 min readLW link
(transformer-circuits.pub)

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
13 points
15 comments13 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC
47 points
8 comments36 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC
34 points
4 comments39 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaley7 Dec 2023 6:14 UTC
9 points
0 comments11 min readLW link

Rep­re­sen­ta­tion Tuning

Christopher Ackerman27 Jun 2024 17:44 UTC
35 points
9 comments13 min readLW link

[Paper] Pro­gram­ming Re­fusal with Con­di­tional Ac­ti­va­tion Steering

Bruce W. Lee11 Sep 2024 20:57 UTC
41 points
0 comments11 min readLW link
(arxiv.org)

In­vo­ca­tions: The Other Ca­pa­bil­ities Over­hang?

Robert_AIZI4 Apr 2023 13:38 UTC
29 points
4 comments4 min readLW link
(aizi.substack.com)

Re­sults from the lan­guage model hackathon

Esben Kran10 Oct 2022 8:29 UTC
22 points
1 comment4 min readLW link

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean Osier30 Jun 2024 2:48 UTC
12 points
5 comments6 min readLW link
(github.com)

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
65 points
14 comments13 min readLW link

LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

NickyP26 Mar 2023 21:57 UTC
99 points
3 comments41 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Panickssery9 Aug 2023 7:06 UTC
69 points
20 comments12 min readLW link

LLMs may cap­ture key com­po­nents of hu­man agency

catubc17 Nov 2022 20:14 UTC
27 points
0 comments4 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC
51 points
14 comments7 min readLW link

In­verse Scal­ing Prize: Se­cond Round Winners

24 Jan 2023 20:12 UTC
58 points
17 comments15 min readLW link

Test­ing PaLM prompts on GPT3

Yitz6 Apr 2022 5:21 UTC
103 points
14 comments8 min readLW link

Large Lan­guage Models will be Great for Censorship

Ethan Edwards21 Aug 2023 19:03 UTC
183 points
14 comments8 min readLW link
(ethanedwards.substack.com)

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
5 comments2 min readLW link
(arxiv.org)

[Question] Is there a ‘time se­ries fore­cast­ing’ equiv­a­lent of AIXI?

Solenoid_Entity17 May 2023 4:35 UTC
12 points
2 comments1 min readLW link

[ASoT] Some thoughts about LM monologue limi­ta­tions and ELK

leogao30 Mar 2022 14:26 UTC
10 points
0 comments2 min readLW link

Pro­ce­du­rally eval­u­at­ing fac­tual ac­cu­racy: a re­quest for research

Jacob_Hilton30 Mar 2022 16:37 UTC
25 points
2 comments6 min readLW link

[Link] Train­ing Com­pute-Op­ti­mal Large Lan­guage Models

nostalgebraist31 Mar 2022 18:01 UTC
51 points
23 comments1 min readLW link
(arxiv.org)

In­flec­tion AI: New startup re­lated to lan­guage models

Nisan2 Apr 2022 5:35 UTC
21 points
1 comment1 min readLW link

New Scal­ing Laws for Large Lan­guage Models

1a3orn1 Apr 2022 20:41 UTC
246 points
22 comments5 min readLW link

Claude 3.5 Sonnet

Zach Stein-Perlman20 Jun 2024 18:00 UTC
75 points
41 comments1 min readLW link
(www.anthropic.com)

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
31 points
0 comments19 min readLW link

How to train your trans­former

p.b.7 Apr 2022 9:34 UTC
6 points
0 comments8 min readLW link

“On the Im­pos­si­bil­ity of Su­per­in­tel­li­gent Ru­bik’s Cube Solvers”, Claude 2024 [hu­mor]

gwern23 Jun 2024 21:18 UTC
22 points
6 comments1 min readLW link
(gwern.net)

Pre­dic­tion Mar­ket Trad­ing as an LLM Benchmark

Jesse Richardson25 Jun 2024 0:46 UTC
8 points
1 comment4 min readLW link

Lan­guage Model Tools for Align­ment Research

Logan Riggs8 Apr 2022 17:32 UTC
28 points
0 comments2 min readLW link

On Claude 3.5 Sonnet

Zvi24 Jun 2024 12:00 UTC
95 points
14 comments13 min readLW link
(thezvi.wordpress.com)

AMA Con­jec­ture, A New Align­ment Startup

adamShimi9 Apr 2022 9:43 UTC
47 points
42 comments1 min readLW link

Owain Evans on Si­tu­a­tional Aware­ness and Out-of-Con­text Rea­son­ing in LLMs

Michaël Trazzi24 Aug 2024 4:30 UTC
55 points
0 comments5 min readLW link

Why I Believe LLMs Do Not Have Hu­man-like Emotions

OneManyNone22 May 2023 15:46 UTC
13 points
6 comments7 min readLW link

[Linkpost] New multi-modal Deep­mind model fus­ing Chin­chilla with images and videos

p.b.30 Apr 2022 3:47 UTC
53 points
18 comments1 min readLW link

PaLM-2 & GPT-4 in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas Finnveden30 May 2023 18:33 UTC
55 points
6 comments6 min readLW link

Paper: Teach­ing GPT3 to ex­press un­cer­tainty in words

Owain_Evans31 May 2022 13:27 UTC
97 points
7 comments4 min readLW link

Boot­strap­ping Lan­guage Models

harsimony27 May 2022 19:43 UTC
7 points
5 comments2 min readLW link

LIMA: Less Is More for Alignment

Ulisse Mini30 May 2023 17:10 UTC
16 points
6 comments1 min readLW link
(arxiv.org)

[Linkpost] Vague Ver­biage in Forecasting

trevor22 Mar 2024 18:05 UTC
11 points
9 comments3 min readLW link
(goodjudgment.com)

Covert Mal­i­cious Finetuning

2 Jul 2024 2:41 UTC
88 points
4 comments3 min readLW link

“LLMs Don’t Have a Co­her­ent Model of the World”—What it Means, Why it Mat­ters

Davidmanheim1 Jun 2023 7:46 UTC
31 points
2 comments7 min readLW link

Lamda is not an LLM

Kevin19 Jun 2022 11:13 UTC
7 points
10 comments1 min readLW link
(www.wired.com)

Mus­ings on LLM Scale (Jul 2024)

Vladimir_Nesov3 Jul 2024 18:35 UTC
33 points
0 comments3 min readLW link

Con­di­tion­ing Gen­er­a­tive Models

Adam Jermyn25 Jun 2022 22:15 UTC
24 points
18 comments10 min readLW link

Claude Doesn’t Want to Die

garrison5 Mar 2024 6:00 UTC
22 points
3 comments1 min readLW link
(garrisonlovely.substack.com)

Assess­ing AlephAlphas Mul­ti­modal Model

p.b.28 Jun 2022 9:28 UTC
30 points
5 comments3 min readLW link

LEAst-squares Con­cept Era­sure (LEACE)

tricky_labyrinth7 Jun 2023 21:51 UTC
68 points
10 comments1 min readLW link
(twitter.com)

[Linkpost] Solv­ing Quan­ti­ta­tive Rea­son­ing Prob­lems with Lan­guage Models

Yitz30 Jun 2022 18:58 UTC
76 points
15 comments2 min readLW link
(storage.googleapis.com)

Minerva

Algon1 Jul 2022 20:06 UTC
36 points
6 comments2 min readLW link
(ai.googleblog.com)

Deep learn­ing cur­ricu­lum for large lan­guage model alignment

Jacob_Hilton13 Jul 2022 21:58 UTC
57 points
3 comments1 min readLW link
(github.com)

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
59 points
8 comments20 min readLW link

Me­taAI: less is less for al­ign­ment.

Cleo Nardo13 Jun 2023 14:08 UTC
68 points
17 comments5 min readLW link

Ex­per­i­ments in Eval­u­at­ing Steer­ing Vectors

Gytis Daujotas19 Jun 2023 15:11 UTC
34 points
4 comments4 min readLW link

[Question] Im­pact of ” ‘Let’s think step by step’ is all you need”?

yrimon24 Jul 2022 20:59 UTC
20 points
2 comments1 min readLW link

chin­chilla’s wild implications

nostalgebraist31 Jul 2022 1:18 UTC
420 points
128 comments10 min readLW link1 review

[Question] Why no ma­jor LLMs with mem­ory?

Kaj_Sotala28 Mar 2023 16:34 UTC
41 points
15 comments1 min readLW link

Emer­gent Abil­ities of Large Lan­guage Models [Linkpost]

aogara10 Aug 2022 18:02 UTC
25 points
2 comments1 min readLW link
(arxiv.org)

Lan­guage mod­els seem to be much bet­ter than hu­mans at next-to­ken prediction

11 Aug 2022 17:45 UTC
182 points
60 comments13 min readLW link1 review

A lit­tle play­ing around with Blen­der­bot3

Nathan Helm-Burger12 Aug 2022 16:06 UTC
9 points
0 comments1 min readLW link

Cor­rigi­bil­ity, Self-Dele­tion, and Iden­ti­cal Strawberries

Robert_AIZI28 Mar 2023 16:54 UTC
9 points
2 comments6 min readLW link
(aizi.substack.com)

[Question] Are lan­guage mod­els close to the su­per­hu­man level in philos­o­phy?

Roman Leventov19 Aug 2022 4:43 UTC
6 points
2 comments2 min readLW link

“text­books are all you need”

bhauth21 Jun 2023 17:06 UTC
66 points
18 comments2 min readLW link
(arxiv.org)

A Test for Lan­guage Model Consciousness

Ethan Perez25 Aug 2022 19:41 UTC
18 points
14 comments9 min readLW link

Strat­egy For Con­di­tion­ing Gen­er­a­tive Models

1 Sep 2022 4:34 UTC
31 points
4 comments18 min readLW link

Re­la­tional Speaking

jefftk21 Jun 2023 14:40 UTC
11 points
0 comments2 min readLW link
(www.jefftk.com)

Us­ing Claude to con­vert di­a­log tran­scripts into great posts?

mako yass21 Jun 2023 20:19 UTC
6 points
4 comments4 min readLW link

Alex­aTM − 20 Billion Pa­ram­e­ter Model With Im­pres­sive Performance

MrThink9 Sep 2022 21:46 UTC
5 points
0 comments1 min readLW link

Three of my be­liefs about up­com­ing AGI

Robert_AIZI27 Mar 2023 20:27 UTC
6 points
0 comments3 min readLW link
(aizi.substack.com)

[Question] Which parts of the ex­ist­ing in­ter­net are already likely to be in (GPT-5/​other soon-to-be-trained LLMs)’s train­ing cor­pus?

AnnaSalamon29 Mar 2023 5:17 UTC
49 points
2 comments1 min readLW link

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllis17 Sep 2022 19:48 UTC
19 points
13 comments3 min readLW link

Role Ar­chi­tec­tures: Ap­ply­ing LLMs to con­se­quen­tial tasks

Eric Drexler30 Mar 2023 15:00 UTC
60 points
7 comments9 min readLW link

Dou­glas Hofs­tadter changes his mind on Deep Learn­ing & AI risk (June 2023)?

gwern3 Jul 2023 0:48 UTC
425 points
54 comments7 min readLW link
(www.youtube.com)

Goal-Direc­tion for Si­mu­lated Agents

Raymond D12 Jul 2023 17:06 UTC
33 points
2 comments6 min readLW link

[Question] If I ask an LLM to think step by step, how big are the steps?

ryan_b13 Sep 2024 20:30 UTC
7 points
1 comment1 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
103 points
28 comments5 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan H9 Jul 2024 15:40 UTC
36 points
1 comment1 min readLW link
(rdi.berkeley.edu)

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC
51 points
1 comment3 min readLW link

LLMs as a Plan­ning Overhang

Larks14 Jul 2024 2:54 UTC
38 points
8 comments2 min readLW link

Case for Foun­da­tion Models be­yond English

Varshul Gupta21 Jul 2023 13:59 UTC
1 point
0 comments3 min readLW link
(dubverseblack.substack.com)

How do LLMs give truth­ful an­swers? A dis­cus­sion of LLM vs. hu­man rea­son­ing, en­sem­bles & parrots

Owain_Evans28 Mar 2024 2:34 UTC
26 points
0 comments9 min readLW link

En­hanc­ing biose­cu­rity with lan­guage mod­els: defin­ing re­search directions

mic26 Mar 2024 12:30 UTC
12 points
0 comments1 min readLW link
(papers.ssrn.com)

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
40 points
19 comments10 min readLW link

Water­mark­ing con­sid­ered over­rated?

DanielFilan31 Jul 2023 21:36 UTC
19 points
4 comments1 min readLW link

Did ChatGPT just gaslight me?

TW1231 Dec 2022 5:41 UTC
123 points
45 comments9 min readLW link
(aiwatchtower.substack.com)

GPT-4 can catch sub­tle cross-lan­guage trans­la­tion mistakes

Michael Tontchev27 Jul 2023 1:39 UTC
7 points
1 comment1 min readLW link

Chat GPT’s views on Me­ta­physics and Ethics

Cole Killian3 Dec 2022 18:12 UTC
5 points
3 comments1 min readLW link
(twitter.com)

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
122 points
17 comments9 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC
61 points
24 comments10 min readLW link
(aizi.substack.com)

[Question] Does a LLM have a util­ity func­tion?

Dagon9 Dec 2022 17:19 UTC
17 points
11 comments1 min readLW link

Pre-reg­is­ter­ing a study

Robert_AIZI7 Apr 2023 15:46 UTC
10 points
0 comments6 min readLW link
(aizi.substack.com)

Up­com­ing Changes in Large Lan­guage Models

Andrew Keenan Richardson8 Apr 2023 3:41 UTC
43 points
8 comments4 min readLW link
(mechanisticmind.com)

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC
45 points
1 comment1 min readLW link
(arxiv.org)

Claude 3 claims it’s con­scious, doesn’t want to die or be modified

Mikhail Samin4 Mar 2024 23:05 UTC
72 points
113 comments14 min readLW link

Take 11: “Align­ing lan­guage mod­els” should be weirder.

Charlie Steiner18 Dec 2022 14:14 UTC
34 points
0 comments2 min readLW link

Map­ping the se­man­tic void: Strange go­ings-on in GPT em­bed­ding spaces

mwatkins14 Dec 2023 13:10 UTC
114 points
31 comments14 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
66 points
7 comments7 min readLW link

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

20 Dec 2022 20:08 UTC
100 points
34 comments1 min readLW link
(www.anthropic.com)

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

Akash20 Dec 2022 21:39 UTC
18 points
2 comments11 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
312 points
28 comments18 min readLW link

Scaf­folded LLMs as nat­u­ral lan­guage computers

beren12 Apr 2023 10:47 UTC
94 points
10 comments11 min readLW link

Mlyyrczo

lsusr26 Dec 2022 7:58 UTC
41 points
14 comments3 min readLW link

An­thropic re­lease Claude 3, claims >GPT-4 Performance

LawrenceC4 Mar 2024 18:23 UTC
115 points
41 comments2 min readLW link
(www.anthropic.com)

Does Chat-GPT dis­play ‘Scope Insen­si­tivity’?

callum7 Dec 2023 18:58 UTC
11 points
0 comments3 min readLW link

‘simu­la­tor’ fram­ing and con­fu­sions about LLMs

Beth Barnes31 Dec 2022 23:38 UTC
104 points
11 comments4 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
436 points
97 comments50 min readLW link

Re­search Dis­cus­sion on PSCA with Claude Son­net 3.5

Robert Kralisch24 Jul 2024 16:53 UTC
−2 points
0 comments25 min readLW link

The ‘ pe­ter­todd’ phenomenon

mwatkins15 Apr 2023 0:59 UTC
192 points
49 comments38 min readLW link

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

4 Sep 2023 12:54 UTC
108 points
16 comments5 min readLW link
(arxiv.org)

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC
71 points
18 comments6 min readLW link

And All the Shog­goths Merely Players

Zack_M_Davis10 Feb 2024 19:56 UTC
160 points
57 comments12 min readLW link

“AI achieves silver-medal stan­dard solv­ing In­ter­na­tional Math­e­mat­i­cal Olympiad prob­lems”

gjm25 Jul 2024 15:58 UTC
133 points
38 comments2 min readLW link
(deepmind.google)

Pro­posal for In­duc­ing Steganog­ra­phy in LMs

Logan Riggs12 Jan 2023 22:15 UTC
22 points
3 comments2 min readLW link

[Linkpost] Scal­ing Laws for Gen­er­a­tive Mixed-Mo­dal Lan­guage Models

Amal 12 Jan 2023 14:24 UTC
15 points
2 comments1 min readLW link
(arxiv.org)

[Question] Ba­sic Ques­tion about LLMs: how do they know what task to perform

Garak14 Jan 2023 13:13 UTC
1 point
3 comments1 min readLW link

Un­der­stand­ing the diffu­sion of large lan­guage mod­els: summary

Ben Cottier16 Jan 2023 1:37 UTC
26 points
1 comment1 min readLW link

Lan­guage mod­els can gen­er­ate su­pe­rior text com­pared to their input

ChristianKl17 Jan 2023 10:57 UTC
48 points
28 comments1 min readLW link

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

William_S19 Jan 2023 19:49 UTC
32 points
4 comments2 min readLW link

Test­ing for par­allel rea­son­ing in LLMs

19 May 2024 15:28 UTC
3 points
7 comments9 min readLW link

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

Max H11 Oct 2023 0:53 UTC
35 points
5 comments11 min readLW link

Why did ChatGPT say that? Prompt en­g­ineer­ing and more, with PIZZA.

Jessica Rumbelow3 Aug 2024 12:07 UTC
40 points
2 comments4 min readLW link

Do LLMs dream of emer­gent sheep?

Shmi24 Apr 2023 3:26 UTC
16 points
2 comments1 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

2 Feb 2023 20:28 UTC
88 points
4 comments13 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

2 Feb 2023 20:28 UTC
72 points
15 comments57 min readLW link

I didn’t think I’d take the time to build this cal­ibra­tion train­ing game, but with web­sim it took roughly 30 sec­onds, so here it is!

mako yass2 Aug 2024 22:35 UTC
24 points
2 comments5 min readLW link

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

6 Feb 2023 20:08 UTC
20 points
3 comments11 min readLW link

Claude 3 Opus can op­er­ate as a Tur­ing machine

Gunnar_Zarncke17 Apr 2024 8:41 UTC
36 points
2 comments1 min readLW link
(twitter.com)

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

6 Feb 2023 19:09 UTC
111 points
45 comments13 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
158 points
8 comments5 min readLW link

Can I take ducks home from the park?

dynomight14 Sep 2023 21:03 UTC
67 points
8 comments3 min readLW link
(dynomight.net)

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC
79 points
11 comments15 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

8 Feb 2023 18:19 UTC
32 points
2 comments11 min readLW link

Notes on the Math­e­mat­ics of LLM Architectures

carboniferous_umbraculum 9 Feb 2023 1:45 UTC
13 points
2 comments1 min readLW link
(drive.google.com)

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
676 points
205 comments12 min readLW link

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

9 Feb 2023 20:59 UTC
28 points
0 comments10 min readLW link

An ex­am­i­na­tion of GPT-2′s bor­ing yet effec­tive glitch

MiguelDev18 Apr 2024 5:26 UTC
5 points
3 comments3 min readLW link

In Defense of Chat­bot Romance

Kaj_Sotala11 Feb 2023 14:30 UTC
123 points
52 comments11 min readLW link
(kajsotala.fi)

[Question] Is In­struc­tGPT Fol­low­ing In­struc­tions in Other Lan­guages Sur­pris­ing?

DragonGod13 Feb 2023 23:26 UTC
39 points
15 comments1 min readLW link

Bing Chat is blatantly, ag­gres­sively misaligned

evhub15 Feb 2023 5:29 UTC
400 points
180 comments2 min readLW link

What’s up with all the non-Mor­mons? Weirdly spe­cific uni­ver­sal­ities across LLMs

mwatkins19 Apr 2024 13:43 UTC
40 points
13 comments27 min readLW link

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

24 Oct 2023 0:30 UTC
66 points
0 comments2 min readLW link
(arxiv.org)

Ro­mance, mi­s­un­der­stand­ing, so­cial stances, and the hu­man LLM

Kaj_Sotala27 Apr 2023 12:59 UTC
69 points
32 comments16 min readLW link

AGI will be made of het­ero­ge­neous com­po­nents, Trans­former and Selec­tive SSM blocks will be among them

Roman Leventov27 Dec 2023 14:51 UTC
33 points
9 comments4 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

23 Sep 2023 19:55 UTC
120 points
74 comments4 min readLW link
(arxiv.org)

AI doom from an LLM-plateau-ist perspective

Steven Byrnes27 Apr 2023 13:58 UTC
157 points
24 comments6 min readLW link

Some Quick Fol­low-Up Ex­per­i­ments to “Taken out of con­text: On mea­sur­ing situ­a­tional aware­ness in LLMs”

Miles Turpin3 Oct 2023 2:22 UTC
31 points
0 comments9 min readLW link

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanB4 Oct 2023 17:10 UTC
97 points
8 comments3 min readLW link

What do lan­guage mod­els know about fic­tional char­ac­ters?

skybrian22 Feb 2023 5:58 UTC
6 points
0 comments4 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David Udell9 Jul 2024 22:06 UTC
53 points
7 comments7 min readLW link

Ag­grega­tive Prin­ci­ples of So­cial Justice

Cleo Nardo5 Jun 2024 13:44 UTC
29 points
10 comments37 min readLW link

Meta “open sources” LMs com­pet­i­tive with Chin­chilla, PaLM, and code-davinci-002 (Paper)

LawrenceC24 Feb 2023 19:57 UTC
38 points
19 comments1 min readLW link
(research.facebook.com)

A Pro­posed Test to Deter­mine the Ex­tent to Which Large Lan­guage Models Un­der­stand the Real World

Bruce G24 Feb 2023 20:20 UTC
4 points
7 comments8 min readLW link

Evil au­to­com­plete: Ex­is­ten­tial Risk and Next-To­ken Predictors

Yitz28 Feb 2023 8:47 UTC
9 points
3 comments5 min readLW link

Map­ping the se­man­tic void II: Above, be­low and be­tween to­ken em­bed­dings

mwatkins15 Feb 2024 23:00 UTC
31 points
4 comments10 min readLW link

[Linkpost] Play with SAEs on Llama 3

25 Sep 2024 22:35 UTC
40 points
2 comments1 min readLW link

[Question] Sup­pos­ing the 1bit LLM pa­per pans out

O O29 Feb 2024 5:31 UTC
27 points
11 comments1 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
628 points
187 comments16 min readLW link

The Stochas­tic Par­rot Hy­poth­e­sis is de­bat­able for the last gen­er­a­tion of LLMs

7 Nov 2023 16:12 UTC
52 points
20 comments6 min readLW link

Google’s PaLM-E: An Em­bod­ied Mul­ti­modal Lan­guage Model

SandXbox7 Mar 2023 4:11 UTC
87 points
7 comments1 min readLW link
(palm-e.github.io)

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC
5 points
2 comments4 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC
119 points
15 comments22 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
204 points
40 comments45 min readLW link

GPT can write Quines now (GPT-4)

Andrew_Critch14 Mar 2023 19:18 UTC
112 points
30 comments1 min readLW link

Eleuther re­leases Llemma: An Open Lan­guage Model For Mathematics

mako yass17 Oct 2023 20:03 UTC
22 points
0 comments1 min readLW link
(blog.eleuther.ai)

No­kens: A po­ten­tial method of in­ves­ti­gat­ing glitch tokens

Hoagy15 Mar 2023 16:23 UTC
21 points
0 comments4 min readLW link

What’s up with LLMs rep­re­sent­ing XORs of ar­bi­trary fea­tures?

Sam Marks3 Jan 2024 19:44 UTC
157 points
61 comments16 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkins4 Oct 2024 22:15 UTC
37 points
5 comments6 min readLW link

[Question] Will 2023 be the last year you can write short sto­ries and re­ceive most of the in­tel­lec­tual credit for writ­ing them?

lc16 Mar 2023 21:36 UTC
20 points
11 comments1 min readLW link

Su­per-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC
16 points
9 comments1 min readLW link

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

16 Nov 2023 20:18 UTC
45 points
3 comments1 min readLW link
(arxiv.org)

Mus­ings on Text Data Wall (Oct 2024)

Vladimir_Nesov5 Oct 2024 19:00 UTC
20 points
2 comments5 min readLW link

What does it mean for an LLM such as GPT to be al­igned /​ good /​ pos­i­tive im­pact?

PashaKamyshev20 Mar 2023 9:21 UTC
4 points
3 comments10 min readLW link

Ex­am­ples of How I Use LLMs

jefftk14 Oct 2024 17:10 UTC
29 points
2 comments2 min readLW link
(www.jefftk.com)

RLHF does not ap­pear to differ­en­tially cause mode-collapse

20 Mar 2023 15:39 UTC
95 points
9 comments3 min readLW link

′ pe­ter­todd’’s last stand: The fi­nal days of open GPT-3 research

mwatkins22 Jan 2024 18:47 UTC
109 points
16 comments45 min readLW link

Ex­trap­o­lat­ing from Five Words

Gordon Seidoh Worley15 Nov 2023 23:21 UTC
40 points
11 comments2 min readLW link

Lin­ear en­cod­ing of char­ac­ter-level in­for­ma­tion in GPT-J to­ken embeddings

10 Nov 2023 22:19 UTC
34 points
4 comments28 min readLW link

So­ci­aLLM: pro­posal for a lan­guage model de­sign for per­son­al­ised apps, so­cial sci­ence, and AI safety research

Roman Leventov19 Dec 2023 16:49 UTC
17 points
5 comments3 min readLW link

Nav­i­gat­ing LLM em­bed­ding spaces us­ing archetype-based directions

mwatkins8 May 2024 5:54 UTC
15 points
4 comments28 min readLW link

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

19 Dec 2023 19:14 UTC
45 points
4 comments6 min readLW link
(arxiv.org)

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
69 points
5 comments10 min readLW link

Cog­ni­tive Bi­ases in Large Lan­guage Models

Jan25 Sep 2021 20:59 UTC
18 points
3 comments12 min readLW link
(universalprior.substack.com)

NVIDIA and Microsoft re­leases 530B pa­ram­e­ter trans­former model, Me­ga­tron-Tur­ing NLG

Ozyrus11 Oct 2021 15:28 UTC
51 points
36 comments1 min readLW link
(developer.nvidia.com)

NLP Po­si­tion Paper: When Com­bat­ting Hype, Pro­ceed with Caution

Sam Bowman15 Oct 2021 20:57 UTC
46 points
14 comments1 min readLW link

Study­ing The Alien Mind

5 Dec 2023 17:27 UTC
80 points
10 comments15 min readLW link

Fore­cast­ing progress in lan­guage models

28 Oct 2021 20:40 UTC
62 points
6 comments11 min readLW link
(www.metaculus.com)

Why keep a di­ary, and why wish for large lan­guage models

DanielFilan14 Jun 2024 16:10 UTC
9 points
1 comment2 min readLW link
(danielfilan.com)

Deep­mind’s Go­pher—more pow­er­ful than GPT-3

hath8 Dec 2021 17:06 UTC
86 points
26 comments1 min readLW link
(deepmind.com)

Teaser: Hard-cod­ing Trans­former Models

MadHatter12 Dec 2021 22:04 UTC
74 points
19 comments1 min readLW link

LLM Ap­pli­ca­tions I Want To See

sarahconstantin19 Aug 2024 21:10 UTC
102 points
5 comments8 min readLW link
(sarahconstantin.substack.com)

Lan­guage Model Align­ment Re­search Internships

Ethan Perez13 Dec 2021 19:53 UTC
74 points
1 comment1 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC
28 points
6 comments8 min readLW link

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

7 May 2023 0:46 UTC
76 points
24 comments11 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom Lieberum24 Dec 2021 18:05 UTC
16 points
2 comments3 min readLW link

AI Safety Chatbot

21 Dec 2023 14:06 UTC
61 points
11 comments4 min readLW link

In­fer­ring the model di­men­sion of API-pro­tected LLMs

Ege Erdil18 Mar 2024 6:19 UTC
34 points
3 comments4 min readLW link
(arxiv.org)

New OpenAI Paper—Lan­guage mod­els can ex­plain neu­rons in lan­guage models

MrThink10 May 2023 7:46 UTC
47 points
14 comments1 min readLW link

A one-ques­tion Tur­ing test for GPT-3

22 Jan 2022 18:17 UTC
85 points
25 comments5 min readLW link

LLM-Se­cured Sys­tems: A Gen­eral-Pur­pose Tool For Struc­tured Transparency

ozziegooen18 Jun 2024 0:21 UTC
7 points
1 comment1 min readLW link

Jailbreak steer­ing generalization

20 Jun 2024 17:25 UTC
41 points
4 comments2 min readLW link
(arxiv.org)

LLM Guardrails Should Have Bet­ter Cus­tomer Ser­vice Tuning

Jiao Bu13 May 2023 22:54 UTC
2 points
0 comments2 min readLW link

Quick Thoughts on Lan­guage Models

RohanS18 Jul 2023 20:38 UTC
6 points
0 comments4 min readLW link

Un­safe AI as Dy­nam­i­cal Systems

Robert_AIZI14 Jul 2023 15:31 UTC
11 points
0 comments3 min readLW link
(aizi.substack.com)

Spec­u­la­tive in­fer­ences about path de­pen­dence in LLM su­per­vised fine-tun­ing from re­sults on lin­ear mode con­nec­tivity and model souping

RobertKirk20 Jul 2023 9:56 UTC
39 points
2 comments5 min readLW link

An­ti­ci­pa­tion in LLMs

derek shiller24 Jul 2023 15:53 UTC
6 points
0 comments13 min readLW link

AI Aware­ness through In­ter­ac­tion with Blatantly Alien Models

VojtaKovarik28 Jul 2023 8:41 UTC
7 points
5 comments3 min readLW link

[Linkpost] Mul­ti­modal Neu­rons in Pre­trained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC
11 points
0 comments1 min readLW link

[Linkpost] De­cep­tion Abil­ities Emerged in Large Lan­guage Models

Bogdan Ionut Cirstea3 Aug 2023 17:28 UTC
12 points
0 comments1 min readLW link

Re­searchers and writ­ers can ap­ply for proxy ac­cess to the GPT-3.5 base model (code-davinci-002)

ampdot1 Dec 2023 18:48 UTC
14 points
0 comments1 min readLW link
(airtable.com)

A Sim­ple The­ory Of Consciousness

SherlockHolmes8 Aug 2023 18:05 UTC
2 points
5 comments1 min readLW link
(peterholmes.medium.com)

In­flec­tion.ai is a ma­jor AGI lab

nikola9 Aug 2023 1:05 UTC
137 points
13 comments2 min readLW link

Ex­plor­ing the Mul­ti­verse of Large Lan­guage Models

franky6 Aug 2023 2:38 UTC
1 point
0 comments5 min readLW link

Google Deep­Mind’s RT-2

SandXbox11 Aug 2023 11:26 UTC
9 points
1 comment1 min readLW link
(robotics-transformer2.github.io)

Co­her­ence Ther­apy with LLMs—quick demo

Chipmonk14 Aug 2023 3:34 UTC
19 points
11 comments1 min readLW link

[Question] Any re­search in “probe-tun­ing” of LLMs?

Roman Leventov15 Aug 2023 21:01 UTC
20 points
3 comments1 min readLW link

Memetic Judo #3: The In­tel­li­gence of Stochas­tic Par­rots v.2

Max TK20 Aug 2023 15:18 UTC
8 points
33 comments6 min readLW link

[Question] Would it be use­ful to col­lect the con­texts, where var­i­ous LLMs think the same?

Martin Vlach24 Aug 2023 22:01 UTC
6 points
1 comment1 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

Xanadu, GPT, and Beyond: An ad­ven­ture of the mind

Bill Benzon27 Aug 2023 16:19 UTC
2 points
0 comments5 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
77 points
4 comments1 min readLW link

Re­port on An­a­lyz­ing Con­no­ta­tion Frames in Evolv­ing Wikipe­dia Biographies

Maira30 Aug 2023 22:02 UTC
1 point
0 comments4 min readLW link

Can an LLM iden­tify ring-com­po­si­tion in a liter­ary text? [ChatGPT]

Bill Benzon1 Sep 2023 14:18 UTC
4 points
2 comments11 min readLW link

[Linkpost] Large lan­guage mod­els con­verge to­ward hu­man-like con­cept organization

Bogdan Ionut Cirstea2 Sep 2023 6:00 UTC
22 points
1 comment1 min readLW link

What must be the case that ChatGPT would have mem­o­rized “To be or not to be”? – Three kinds of con­cep­tual ob­jects for LLMs

Bill Benzon3 Sep 2023 18:39 UTC
19 points
0 comments12 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

6 Sep 2023 17:21 UTC
105 points
3 comments2 min readLW link
(arxiv.org)

World, mind, and learn­abil­ity: A note on the meta­phys­i­cal struc­ture of the cos­mos [& LLMs]

Bill Benzon5 Sep 2023 12:19 UTC
4 points
1 comment5 min readLW link

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob Dunefsky12 Sep 2023 17:38 UTC
13 points
0 comments29 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

14 Sep 2023 1:40 UTC
32 points
7 comments8 min readLW link
(far.ai)

[un­ti­tled post]

verwindung14 Sep 2023 16:22 UTC
1 point
0 comments1 min readLW link

Dis­cur­sive Com­pe­tence in ChatGPT, Part 2: Me­mory for Texts

Bill Benzon28 Sep 2023 16:34 UTC
1 point
0 comments3 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
137 points
11 comments19 min readLW link

Image Hi­jacks: Ad­ver­sar­ial Images can Con­trol Gen­er­a­tive Models at Runtime

20 Sep 2023 15:23 UTC
58 points
9 comments1 min readLW link
(arxiv.org)

Notes on ChatGPT’s “mem­ory” for strings and for events

Bill Benzon20 Sep 2023 18:12 UTC
3 points
0 comments10 min readLW link

A quick re­mark on so-called “hal­lu­ci­na­tions” in LLMs and hu­mans

Bill Benzon23 Sep 2023 12:17 UTC
4 points
4 comments1 min readLW link

Ex­pec­ta­tions for Gem­ini: hope­fully not a big deal

Maxime Riché2 Oct 2023 15:38 UTC
15 points
5 comments1 min readLW link

What would it mean to un­der­stand how a large lan­guage model (LLM) works? Some quick notes.

Bill Benzon3 Oct 2023 15:11 UTC
20 points
4 comments8 min readLW link

En­tan­gle­ment and in­tu­ition about words and mean­ing

Bill Benzon4 Oct 2023 14:16 UTC
4 points
0 comments2 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

8 Dec 2023 17:08 UTC
81 points
7 comments7 min readLW link

Un­der­stand­ing LLMs: Some ba­sic ob­ser­va­tions about words, syn­tax, and dis­course [w/​ a con­jec­ture about grokking]

Bill Benzon11 Oct 2023 19:13 UTC
6 points
0 comments5 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

23 Oct 2023 14:11 UTC
20 points
2 comments5 min readLW link
(far.ai)

Are (at least some) Large Lan­guage Models Holo­graphic Me­mory Stores?

Bill Benzon20 Oct 2023 13:07 UTC
11 points
4 comments6 min readLW link

ChatGPT tells 20 ver­sions of its pro­to­typ­i­cal story, with a short note on method

Bill Benzon14 Oct 2023 15:27 UTC
6 points
0 comments5 min readLW link

Map­ping ChatGPT’s on­tolog­i­cal land­scape, gra­di­ents and choices [in­ter­pretabil­ity]

Bill Benzon15 Oct 2023 20:12 UTC
1 point
0 comments18 min readLW link

ChatGPT Plays 20 Ques­tions [some­times needs help]

Bill Benzon17 Oct 2023 17:30 UTC
5 points
3 comments12 min readLW link

Align­ment Im­pli­ca­tions of LLM Suc­cesses: a De­bate in One Act

Zack_M_Davis21 Oct 2023 15:22 UTC
241 points
50 comments13 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogao2 Jun 2021 21:32 UTC
82 points
11 comments17 min readLW link

[AN #144]: How lan­guage mod­els can also be fine­tuned for non-lan­guage tasks

Rohin Shah2 Apr 2021 17:20 UTC
19 points
0 comments6 min readLW link
(mailchi.mp)

How truth­ful is GPT-3? A bench­mark for lan­guage models

Owain_Evans16 Sep 2021 10:09 UTC
58 points
24 comments6 min readLW link

[Question] How does OpenAI’s lan­guage model af­fect our AI timeline es­ti­mates?

jimrandomh15 Feb 2019 3:11 UTC
50 points
7 comments1 min readLW link

Build­ing AGI Us­ing Lan­guage Models

leogao9 Nov 2020 16:33 UTC
11 points
1 comment1 min readLW link
(leogao.dev)

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya Cotra5 Mar 2021 22:29 UTC
186 points
75 comments38 min readLW link1 review

The Codex Skep­tic FAQ

Michaël Trazzi24 Aug 2021 16:01 UTC
49 points
24 comments2 min readLW link

On lan­guage mod­el­ing and fu­ture ab­stract rea­son­ing research

alexlyzhov25 Mar 2021 17:43 UTC
3 points
1 comment1 min readLW link
(docs.google.com)

Agen­tic Lan­guage Model Memes

FactorialCode1 Aug 2020 18:03 UTC
16 points
1 comment2 min readLW link

[AN #164]: How well can lan­guage mod­els write code?

Rohin Shah15 Sep 2021 17:20 UTC
13 points
7 comments9 min readLW link
(mailchi.mp)

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

Rohin Shah19 Aug 2020 17:10 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

New GPT-3 competitor

Quintin Pope12 Aug 2021 7:05 UTC
32 points
10 comments1 min readLW link

OpenAI Codex: First Impressions

specbug13 Aug 2021 16:52 UTC
49 points
8 comments4 min readLW link
(sixeleven.in)

AMA on Truth­ful AI: Owen Cot­ton-Bar­ratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC
31 points
15 comments1 min readLW link

Truth­ful and hon­est AI

29 Oct 2021 7:28 UTC
42 points
1 comment13 min readLW link

larger lan­guage mod­els may dis­ap­point you [or, an eter­nally un­finished draft]

nostalgebraist26 Nov 2021 23:08 UTC
260 points
31 comments31 min readLW link2 reviews

Hard-Cod­ing Neu­ral Computation

MadHatter13 Dec 2021 4:35 UTC
34 points
8 comments27 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kitten16 Dec 2021 22:41 UTC
22 points
10 comments21 min readLW link

GPT-3: a dis­ap­point­ing paper

nostalgebraist29 May 2020 19:06 UTC
65 points
43 comments8 min readLW link1 review

A Sum­mary Of An­thropic’s First Paper

Sam Ringer30 Dec 2021 0:48 UTC
85 points
1 comment8 min readLW link

How I’m think­ing about GPT-N

delton13717 Jan 2022 17:11 UTC
54 points
21 comments18 min readLW link

Ex­trap­o­lat­ing GPT-N performance

Lukas Finnveden18 Dec 2020 21:41 UTC
110 points
31 comments22 min readLW link1 review

2+2: On­tolog­i­cal Framework

Lyrialtus1 Feb 2022 1:07 UTC
−15 points
2 comments12 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogao10 Feb 2022 6:56 UTC
30 points
3 comments1 min readLW link
(eaidata.bmk.sh)

New GPT3 Im­pres­sive Ca­pa­bil­ities—In­struc­tGPT3 [1/​2]

simeon_c13 Mar 2022 10:58 UTC
72 points
10 comments7 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
72 points
4 comments6 min readLW link

My agenda for re­search into trans­former ca­pa­bil­ities—Introduction

p.b.5 Apr 2022 21:23 UTC
11 points
1 comment3 min readLW link

Re­search agenda: Can trans­form­ers do sys­tem 2 think­ing?

p.b.6 Apr 2022 13:31 UTC
20 points
0 comments2 min readLW link

PaLM in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas Finnveden6 Apr 2022 13:05 UTC
83 points
19 comments2 min readLW link

Re­search agenda—Build­ing a multi-modal chess-lan­guage model

p.b.7 Apr 2022 12:25 UTC
8 points
2 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Elicit: Lan­guage Models as Re­search Assistants

9 Apr 2022 14:56 UTC
71 points
6 comments13 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC
34 points
33 comments1 min readLW link

Why Copi­lot Ac­cel­er­ates Timelines

Michaël Trazzi26 Apr 2022 22:06 UTC
35 points
14 comments7 min readLW link

A pos­si­ble check against mo­ti­vated rea­son­ing us­ing elicit.org

david reinstein18 May 2022 20:52 UTC
3 points
0 comments1 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
114 points
17 comments12 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
85 points
12 comments11 min readLW link1 review

Who mod­els the mod­els that model mod­els? An ex­plo­ra­tion of GPT-3′s in-con­text model fit­ting ability

Lovre7 Jun 2022 19:37 UTC
112 points
16 comments9 min readLW link

[linkpost] The fi­nal AI bench­mark: BIG-bench

RomanS10 Jun 2022 8:53 UTC
25 points
21 comments1 min readLW link

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

14 Jun 2022 13:57 UTC
28 points
6 comments13 min readLW link

Con­tra Hofs­tadter on GPT-3 Nonsense

rictic15 Jun 2022 21:53 UTC
236 points
24 comments2 min readLW link

Causal con­fu­sion as an ar­gu­ment against the scal­ing hypothesis

20 Jun 2022 10:54 UTC
86 points
30 comments15 min readLW link

Yann LeCun, A Path Towards Au­tonomous Ma­chine In­tel­li­gence [link]

Bill Benzon27 Jun 2022 23:29 UTC
5 points
1 comment1 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
171 points
14 comments7 min readLW link

GPT-3 Catch­ing Fish in Morse Code

Megan Kinniment30 Jun 2022 21:22 UTC
117 points
27 comments8 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
28 points
5 comments19 min readLW link

Help ARC eval­u­ate ca­pa­bil­ities of cur­rent lan­guage mod­els (still need peo­ple)

Beth Barnes19 Jul 2022 4:55 UTC
95 points
6 comments2 min readLW link

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC
18 points
4 comments8 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
130 points
23 comments6 min readLW link

Trans­former lan­guage mod­els are do­ing some­thing more general

Numendil3 Aug 2022 21:13 UTC
53 points
6 comments2 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC
38 points
9 comments4 min readLW link

Google AI in­te­grates PaLM with robotics: SayCan up­date [Linkpost]

Evan R. Murphy24 Aug 2022 20:54 UTC
25 points
0 comments1 min readLW link
(sites.research.google)

Is train­ing data go­ing to be diluted by AI-gen­er­ated con­tent?

Hannes Thurnherr7 Sep 2022 18:13 UTC
10 points
7 comments1 min readLW link

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo Nardo15 Sep 2022 17:54 UTC
35 points
12 comments13 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmz17 Sep 2022 3:55 UTC
143 points
12 comments6 min readLW link1 review

[Question] If we have Hu­man-level chat­bots, won’t we end up be­ing ruled by pos­si­ble peo­ple?

Erlja Jkdf.20 Sep 2022 13:59 UTC
5 points
13 comments1 min readLW link

An Un­ex­pected GPT-3 De­ci­sion in a Sim­ple Gam­ble

hatta_afiq25 Sep 2022 16:46 UTC
8 points
4 comments1 min readLW link

Re­call and Re­gur­gi­ta­tion in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC
43 points
1 comment26 min readLW link

Brief Notes on Transformers

Adam Jermyn26 Sep 2022 14:46 UTC
48 points
3 comments2 min readLW link

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC
52 points
15 comments1 min readLW link
(openreview.net)

Smoke with­out fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC
51 points
22 comments4 min readLW link

They gave LLMs ac­cess to physics simulators

ryan_b17 Oct 2022 21:21 UTC
50 points
18 comments1 min readLW link
(arxiv.org)

Is GPT-N bounded by hu­man ca­pa­bil­ities? No.

Cleo Nardo17 Oct 2022 23:26 UTC
48 points
8 comments2 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
5 points
18 comments54 min readLW link

What will the scaled up GATO look like? (Up­dated with ques­tions)

Amal 25 Oct 2022 12:44 UTC
34 points
22 comments1 min readLW link

[simu­la­tion] 4chan user claiming to be the at­tor­ney hired by Google’s sen­tient chat­bot LaMDA shares wild de­tails of encounter

janus10 Nov 2022 21:39 UTC
19 points
1 comment13 min readLW link
(generative.ink)

Hu­man-level Full-Press Di­plo­macy (some bare facts).

Cleo Nardo22 Nov 2022 20:59 UTC
50 points
7 comments3 min readLW link

Gliders in Lan­guage Models

Alexandre Variengien25 Nov 2022 0:38 UTC
30 points
11 comments10 min readLW link

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC
44 points
8 comments5 min readLW link

[Question] Will the first AGI agent have been de­signed as an agent (in ad­di­tion to an AGI)?

nahoj3 Dec 2022 20:32 UTC
1 point
8 comments1 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
19 points
11 comments2 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterush6 Dec 2022 3:35 UTC
19 points
9 comments1 min readLW link

Pro­saic mis­al­ign­ment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC
42 points
3 comments5 min readLW link

A brain­teaser for lan­guage models

Adam Scherlis12 Dec 2022 2:43 UTC
47 points
3 comments2 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam Scherlis13 Dec 2022 0:46 UTC
42 points
4 comments10 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

Prop­er­ties of cur­rent AIs and some pre­dic­tions of the evolu­tion of AI from the per­spec­tive of scale-free the­o­ries of agency and reg­u­la­tive development

Roman Leventov20 Dec 2022 17:13 UTC
33 points
3 comments36 min readLW link

Notes on Meta’s Di­plo­macy-Play­ing AI

Erich_Grunewald22 Dec 2022 11:34 UTC
14 points
2 comments14 min readLW link
(www.erichgrunewald.com)

The Limit of Lan­guage Models

DragonGod6 Jan 2023 23:53 UTC
44 points
26 comments4 min readLW link

How evolu­tion­ary lineages of LLMs can plan their own fu­ture and act on these plans

Roman Leventov25 Dec 2022 18:11 UTC
39 points
16 comments8 min readLW link

Re­cent ad­vances in Nat­u­ral Lan­guage Pro­cess­ing—Some Woolly spec­u­la­tions (2019 es­say on se­man­tics and lan­guage mod­els)

philosophybear27 Dec 2022 2:11 UTC
1 point
0 comments7 min readLW link

Some Ar­gu­ments Against Strong Scaling

Joar Skalse13 Jan 2023 12:04 UTC
26 points
21 comments16 min readLW link

Large lan­guage mod­els can provide “nor­ma­tive as­sump­tions” for learn­ing hu­man preferences

Stuart_Armstrong2 Jan 2023 19:39 UTC
29 points
12 comments3 min readLW link

MAKE IT BETTER (a po­etic demon­stra­tion of the ba­nal­ity of GPT-3)

rogersbacon2 Jan 2023 20:47 UTC
7 points
2 comments5 min readLW link

On the nat­u­ral­is­tic study of the lin­guis­tic be­hav­ior of ar­tifi­cial intelligence

Bill Benzon3 Jan 2023 9:06 UTC
1 point
0 comments4 min readLW link

Whisper’s Wild Implications

Ollie J3 Jan 2023 12:17 UTC
19 points
6 comments5 min readLW link

How it feels to have your mind hacked by an AI

blaked12 Jan 2023 0:33 UTC
361 points
221 comments17 min readLW link

Spec­u­la­tion on Path-Depen­dance in Large Lan­guage Models.

NickyP15 Jan 2023 20:42 UTC
16 points
2 comments7 min readLW link

Cri­tique of some re­cent philos­o­phy of LLMs’ minds

Roman Leventov20 Jan 2023 12:53 UTC
52 points
8 comments20 min readLW link

Emo­tional at­tach­ment to AIs opens doors to problems

Igor Ivanov22 Jan 2023 20:28 UTC
20 points
10 comments4 min readLW link

ChatGPT in­ti­mates a tan­ta­l­iz­ing fu­ture; its core LLM is or­ga­nized on mul­ti­ple lev­els; and it has bro­ken the idea of think­ing.

Bill Benzon24 Jan 2023 19:05 UTC
5 points
0 comments5 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC
84 points
12 comments4 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob Pfau6 Feb 2023 20:45 UTC
29 points
6 comments3 min readLW link

Two very differ­ent ex­pe­riences with ChatGPT

Sherrinford7 Feb 2023 13:09 UTC
38 points
15 comments5 min readLW link

On The Cur­rent Sta­tus Of AI Dating

Nikita Brancatisano7 Feb 2023 20:00 UTC
52 points
8 comments6 min readLW link

A note on ‘semiotic physics’

metasemi11 Feb 2023 5:12 UTC
11 points
13 comments6 min readLW link

A poem co-writ­ten by ChatGPT

Sherrinford16 Feb 2023 10:17 UTC
13 points
0 comments7 min readLW link

Pow­er­ful mesa-op­ti­mi­sa­tion is already here

Roman Leventov17 Feb 2023 4:59 UTC
35 points
1 comment2 min readLW link
(arxiv.org)

Bing chat is the AI fire alarm

Ratios17 Feb 2023 6:51 UTC
115 points
63 comments3 min readLW link

Microsoft and OpenAI, stop tel­ling chat­bots to role­play as AI

hold_my_fish17 Feb 2023 19:55 UTC
49 points
10 comments1 min readLW link

GPT-4 Predictions

Stephen McAleese17 Feb 2023 23:20 UTC
109 points
27 comments11 min readLW link

Stop post­ing prompt in­jec­tions on Twit­ter and call­ing it “mis­al­ign­ment”

lc19 Feb 2023 2:21 UTC
144 points
9 comments1 min readLW link

Syd­ney the Bin­gena­tor Can’t Think, But It Still Threat­ens People

Valentin Baltadzhiev20 Feb 2023 18:37 UTC
−3 points
2 comments8 min readLW link

The idea that ChatGPT is sim­ply “pre­dict­ing” the next word is, at best, misleading

Bill Benzon20 Feb 2023 11:32 UTC
55 points
87 comments5 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
134 points
19 comments11 min readLW link

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

Giulio21 Feb 2023 11:44 UTC
12 points
0 comments1 min readLW link
(arxiv.org)

[Question] In­ject­ing noise to GPT to get mul­ti­ple answers

bipolo22 Feb 2023 20:02 UTC
1 point
1 comment1 min readLW link

Hello, Elua.

Tamsin Leake23 Feb 2023 5:19 UTC
38 points
18 comments4 min readLW link
(carado.moe)

Reflec­tion Mechanisms as an Align­ment Tar­get—At­ti­tudes on “near-term” AI

2 Mar 2023 4:29 UTC
21 points
0 comments8 min readLW link

Si­tu­a­tional aware­ness in Large Lan­guage Models

Simon Möller3 Mar 2023 18:59 UTC
30 points
2 comments7 min readLW link

The View from 30,000 Feet: Pre­face to the Se­cond EleutherAI Retrospective

7 Mar 2023 16:22 UTC
14 points
0 comments4 min readLW link
(blog.eleuther.ai)

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
140 points
17 comments18 min readLW link
(www.erichgrunewald.com)

Stop call­ing it “jailbreak­ing” ChatGPT

Templarrr10 Mar 2023 11:41 UTC
7 points
9 comments2 min readLW link

The is­sue of mean­ing in large lan­guage mod­els (LLMs)

Bill Benzon11 Mar 2023 23:00 UTC
1 point
34 comments8 min readLW link

ChatGPT (and now GPT4) is very eas­ily dis­tracted from its rules

dmcs15 Mar 2023 17:55 UTC
180 points
42 comments1 min readLW link

Grad­ual take­off, fast failure

Max H16 Mar 2023 22:02 UTC
15 points
4 comments5 min readLW link

[Question] Are nested jailbreaks in­evitable?

judson17 Mar 2023 17:43 UTC
1 point
0 comments1 min readLW link

GPTs’ abil­ity to keep a se­cret is weirdly prompt-dependent

22 Jul 2023 12:21 UTC
31 points
0 comments9 min readLW link

In­stan­ti­at­ing an agent with GPT-4 and text-davinci-003

Max H19 Mar 2023 23:57 UTC
13 points
3 comments32 min readLW link

[Question] What ev­i­dence is there of LLM’s con­tain­ing world mod­els?

Chris_Leong4 Oct 2023 14:33 UTC
17 points
17 comments1 min readLW link

Cat­e­gor­i­cal Or­ga­ni­za­tion in Me­mory: ChatGPT Or­ga­nizes the 665 Topic Tags from My New Sa­vanna Blog

Bill Benzon14 Dec 2023 13:02 UTC
0 points
6 comments2 min readLW link

A vi­sual anal­ogy for text gen­er­a­tion by LLMs?

Bill Benzon16 Dec 2023 17:58 UTC
3 points
0 comments1 min readLW link

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

18 Dec 2023 11:58 UTC
147 points
21 comments10 min readLW link

Lifel­og­ging for Align­ment & Immortality

Dev.Errata17 Aug 2024 23:42 UTC
13 points
3 comments7 min readLW link

Ap­proach­ing Hu­man-Level Fore­cast­ing with Lan­guage Models

29 Feb 2024 22:36 UTC
60 points
6 comments3 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
58 points
0 comments12 min readLW link

Ex­plor­ing the Resi­d­ual Stream of Trans­form­ers for Mechanis­tic In­ter­pretabil­ity — Explained

Zeping Yu26 Dec 2023 0:36 UTC
7 points
1 comment11 min readLW link

The fu­ture of Hu­mans: Oper­a­tors of AI

François-Joseph Lacroix30 Dec 2023 23:46 UTC
1 point
0 comments1 min readLW link
(medium.com)

Does ChatGPT know what a tragedy is?

Bill Benzon31 Dec 2023 7:10 UTC
2 points
4 comments5 min readLW link

An­nounc­ing the Dou­ble Crux Bot

9 Jan 2024 18:54 UTC
52 points
8 comments3 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
83 points
9 comments18 min readLW link

Just be­cause an LLM said it doesn’t mean it’s true: an illus­tra­tive example

dirk21 Aug 2024 21:05 UTC
26 points
12 comments3 min readLW link

What’s go­ing on with Per-Com­po­nent Weight Up­dates?

4gate22 Aug 2024 21:22 UTC
1 point
0 comments6 min readLW link

Maybe talk­ing isn’t the best way to com­mu­ni­cate with LLMs

mnvr17 Jan 2024 6:24 UTC
3 points
1 comment1 min readLW link
(mrmr.io)

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:30 UTC
1 point
0 comments1 min readLW link

In­terLab – a toolkit for ex­per­i­ments with multi-agent interactions

22 Jan 2024 18:23 UTC
69 points
0 comments8 min readLW link
(acsresearch.org)

Pre­dict­ing AGI by the Tur­ing Test

Yuxi_Liu22 Jan 2024 4:22 UTC
21 points
2 comments10 min readLW link
(yuxi-liu-wired.github.io)

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC
1 point
0 comments1 min readLW link

RAND re­port finds no effect of cur­rent LLMs on vi­a­bil­ity of bioter­ror­ism attacks

StellaAthena25 Jan 2024 19:17 UTC
94 points
14 comments1 min readLW link
(www.rand.org)

Put­ting mul­ti­modal LLMs to the Tetris test

1 Feb 2024 16:02 UTC
30 points
5 comments7 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

22 Aug 2024 7:32 UTC
23 points
1 comment21 min readLW link

Why I take short timelines seriously

NicholasKees28 Jan 2024 22:27 UTC
121 points
29 comments4 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
59 points
0 comments14 min readLW link

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
110 points
30 comments5 min readLW link

Look­ing be­yond Everett in mul­ti­ver­sal views of LLMs

kromem29 May 2024 12:35 UTC
10 points
0 comments8 min readLW link

In­duc­ing hu­man-like bi­ases in moral rea­son­ing LMs

20 Feb 2024 16:28 UTC
23 points
3 comments14 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
77 points
4 comments8 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
38 points
10 comments31 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
88 points
14 comments9 min readLW link
(arxiv.org)

What’s ChatGPT’s Fa­vorite Ice Cream Fla­vor? An In­ves­ti­ga­tion Into Syn­thetic Respondents

Greg Robison9 Feb 2024 18:38 UTC
19 points
4 comments15 min readLW link

The Last Laugh: Ex­plor­ing the Role of Hu­mor as a Bench­mark for Large Lan­guage Models

Greg Robison12 Feb 2024 18:34 UTC
4 points
5 comments11 min readLW link

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Miles Turpin11 Mar 2024 23:46 UTC
16 points
0 comments1 min readLW link
(arxiv.org)

[Question] What ex­per­i­ment set­tles the Gary Mar­cus vs Ge­offrey Hin­ton de­bate?

Valentin Baltadzhiev14 Feb 2024 9:06 UTC
12 points
8 comments1 min readLW link

[Question] Can any LLM be rep­re­sented as an Equa­tion?

Valentin Baltadzhiev14 Mar 2024 9:51 UTC
1 point
2 comments1 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
28 points
4 comments24 min readLW link

Phal­lo­cen­tric­ity in GPT-J’s bizarre strat­ified ontology

mwatkins17 Feb 2024 0:16 UTC
56 points
37 comments9 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC
165 points
86 comments12 min readLW link

[Paper] Lan­guage Models Don’t Learn the Phys­i­cal Man­i­fes­ta­tion of Language

22 Feb 2024 18:52 UTC
39 points
23 comments1 min readLW link
(arxiv.org)

The role of philo­soph­i­cal think­ing in un­der­stand­ing large lan­guage mod­els: Cal­ibrat­ing and clos­ing the gap be­tween first-per­son ex­pe­rience and un­der­ly­ing mechanisms

Bill Benzon23 Feb 2024 12:19 UTC
4 points
0 comments10 min readLW link

In­stru­men­tal de­cep­tion and ma­nipu­la­tion in LLMs—a case study

Olli Järviniemi24 Feb 2024 2:07 UTC
39 points
13 comments12 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

15 Mar 2024 23:16 UTC
90 points
0 comments1 min readLW link
(metr.github.io)

XAI re­leases Grok base model

Jacob G-W18 Mar 2024 0:47 UTC
11 points
3 comments1 min readLW link
(x.ai)

The In­for­ma­tion: OpenAI shows ‘Straw­berry’ to feds, races to launch it

Martín Soto27 Aug 2024 23:10 UTC
144 points
15 comments3 min readLW link

[Question] Could LLMs Help Gen­er­ate New Con­cepts in Hu­man Lan­guage?

Pekka Lampelto24 Mar 2024 20:13 UTC
10 points
4 comments2 min readLW link

Your LLM Judge may be biased

29 Mar 2024 16:39 UTC
37 points
9 comments6 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC
7 points
2 comments31 min readLW link

Lan­guage and Ca­pa­bil­ities: Test­ing LLM Math­e­mat­i­cal Abil­ities Across Languages

Ethan Edwards4 Apr 2024 13:18 UTC
24 points
2 comments36 min readLW link

End-to-end hack­ing with lan­guage models

tchauvin5 Apr 2024 15:06 UTC
29 points
0 comments8 min readLW link

[Question] Is LLM Trans­la­tion Without Rosetta Stone pos­si­ble?

cubefox11 Apr 2024 0:36 UTC
14 points
14 comments1 min readLW link

Is Wittgen­stein’s Lan­guage Game used when helping Ai un­der­stand lan­guage?

VisionaryHera4 Jun 2024 7:41 UTC
3 points
6 comments1 min readLW link

Can Large Lan­guage Models effec­tively iden­tify cy­ber­se­cu­rity risks?

emile delcourt30 Aug 2024 20:20 UTC
18 points
0 comments11 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
2 points
8 comments6 min readLW link

[Question] Bar­cod­ing LLM Train­ing Data Sub­sets. Any­one try­ing this for in­ter­pretabil­ity?

right..enough?13 Apr 2024 3:09 UTC
7 points
0 comments7 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC
29 points
7 comments12 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh Levy4 Jun 2024 15:45 UTC
38 points
0 comments17 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
6 comments16 min readLW link

How LLMs Work, in the Style of The Economist

utilistrutil22 Apr 2024 19:06 UTC
0 points
0 comments2 min readLW link

At last! ChatGPT does, shall we say, in­ter­est­ing imi­ta­tions of “Kubla Khan”

Bill Benzon24 Apr 2024 14:56 UTC
−3 points
0 comments4 min readLW link

Re­dun­dant At­ten­tion Heads in Large Lan­guage Models For In Con­text Learning

skunnavakkam1 Sep 2024 20:08 UTC
7 points
1 comment4 min readLW link
(skunnavakkam.github.io)

LLMs seem (rel­a­tively) safe

JustisMills25 Apr 2024 22:13 UTC
53 points
24 comments7 min readLW link
(justismills.substack.com)

An in­ter­est­ing math­e­mat­i­cal model of how LLMs work

Bill Benzon30 Apr 2024 11:01 UTC
4 points
0 comments1 min readLW link

LLMs could be as con­scious as hu­man em­u­la­tions, potentially

weightt an30 Apr 2024 11:36 UTC
15 points
15 comments3 min readLW link

On pre­cise out-of-con­text steering

Olli Järviniemi3 May 2024 9:41 UTC
9 points
6 comments3 min readLW link

Re­la­tion­ships among words, met­al­in­gual defi­ni­tion, and interpretability

Bill Benzon7 Jun 2024 19:18 UTC
2 points
0 comments5 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
95 points
13 comments1 min readLW link
(arxiv.org)

If lan­guage is for com­mu­ni­ca­tion, what does that im­ply about LLMs?

Bill Benzon12 May 2024 2:55 UTC
10 points
0 comments1 min readLW link

Lan­guage Models Model Us

eggsyntax17 May 2024 21:00 UTC
156 points
55 comments7 min readLW link

The In­tel­li­gent Meme Machine

Daniel DiSisto14 Jun 2024 14:26 UTC
1 point
0 comments6 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

Lam­ini’s Tar­geted Hal­lu­ci­na­tion Re­duc­tion May Be a Big Deal for Job Automation

sweenesm18 Jun 2024 15:29 UTC
3 points
0 comments1 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

21 Jun 2024 15:54 UTC
160 points
13 comments8 min readLW link
(arxiv.org)

LLM Gen­er­al­ity is a Timeline Crux

eggsyntax24 Jun 2024 12:52 UTC
217 points
119 comments7 min readLW link

Live The­ory Part 0: Tak­ing In­tel­li­gence Seriously

Sahil26 Jun 2024 21:37 UTC
94 points
3 comments8 min readLW link

Check­ing pub­lic figures on whether they “an­swered the ques­tion” quick anal­y­sis from Har­ris/​Trump de­bate, and a proposal

david reinstein11 Sep 2024 20:25 UTC
7 points
4 comments1 min readLW link
(open.substack.com)

Keep­ing con­tent out of LLM train­ing datasets

Ben Millwood18 Jul 2024 10:27 UTC
3 points
0 comments5 min readLW link

[Question] Should we ex­clude al­ign­ment re­search from LLM train­ing datasets?

Ben Millwood18 Jul 2024 10:27 UTC
1 point
1 comment1 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
65 points
0 comments10 min readLW link

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC
24 points
3 comments2 min readLW link
(arxiv.org)

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
19 comments12 min readLW link

An ex­per­i­ment on hid­den cognition

Olli Järviniemi22 Jul 2024 3:26 UTC
25 points
2 comments7 min readLW link

Does ro­bust­ness im­prove with scale?

25 Jul 2024 20:55 UTC
14 points
0 comments1 min readLW link
(far.ai)

[Question] How to­k­eniza­tion in­fluences prompt­ing?

Boris Kashirin29 Jul 2024 10:28 UTC
9 points
4 comments1 min readLW link

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

30 Jul 2024 15:41 UTC
32 points
0 comments15 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

30 Jul 2024 21:11 UTC
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

Us­ing ide­olog­i­cally-charged lan­guage to get gpt-3.5-turbo to di­s­obey it’s sys­tem prompt: a demo

Milan W24 Aug 2024 0:13 UTC
2 points
0 comments6 min readLW link

LLMs stifle cre­ativity, elimi­nate op­por­tu­ni­ties for serendipi­tous dis­cov­ery and dis­rupt in­ter­gen­er­a­tional trans­fer of wisdom

Ghdz5 Aug 2024 18:27 UTC
6 points
2 comments7 min readLW link

GPT-2 Some­times Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC
13 points
0 comments2 min readLW link
(ronakrm.github.io)

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn Dheudnd14 Aug 2024 22:19 UTC
−6 points
2 comments4 min readLW link

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

25 Sep 2024 14:52 UTC
30 points
2 comments4 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

26 Sep 2024 13:44 UTC
38 points
4 comments1 min readLW link
(arxiv.org)

Self lo­ca­tion for LLMs by LLMs: Self-Assess­ment Check­list.

weightt an26 Sep 2024 19:57 UTC
11 points
0 comments5 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

27 Sep 2024 17:49 UTC
58 points
10 comments4 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido Bergman27 Sep 2024 17:49 UTC
6 points
2 comments9 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC
8 points
0 comments9 min readLW link

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezenga28 Sep 2024 19:02 UTC
2 points
2 comments6 min readLW link

Base LLMs re­fuse too

29 Sep 2024 16:04 UTC
60 points
20 comments10 min readLW link

In-Con­text Learn­ing: An Align­ment Survey

alamerton30 Sep 2024 18:44 UTC
8 points
0 comments20 min readLW link
(docs.google.com)

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk Lim3 Oct 2024 18:04 UTC
5 points
0 comments8 min readLW link

Hamil­to­nian Dy­nam­ics in AI: A Novel Ap­proach to Op­ti­miz­ing Rea­son­ing in Lan­guage Models

Javier Marin Valenzuela9 Oct 2024 19:14 UTC
3 points
0 comments10 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
25 points
0 comments18 min readLW link

[Question] Re­in­force­ment Learn­ing: Essen­tial Step Towards AGI or Ir­rele­vant?

Double17 Oct 2024 3:37 UTC
1 point
0 comments1 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk Lim21 Oct 2024 21:34 UTC
4 points
0 comments3 min readLW link

Meta AI (FAIR) lat­est pa­per in­te­grates sys­tem-1 and sys­tem-2 think­ing into rea­son­ing mod­els.

happy friday24 Oct 2024 16:54 UTC
8 points
0 comments1 min readLW link

Retrieval Aug­mented Genesis

João Ribeiro Medeiros1 Oct 2024 20:18 UTC
6 points
0 comments29 min readLW link

Retrieval Aug­mented Ge­n­e­sis II — Holy Texts Se­man­tics Analysis

João Ribeiro Medeiros26 Oct 2024 17:00 UTC
−1 points
0 comments11 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
38 points
4 comments5 min readLW link

Ed­u­ca­tional CAI: Align­ing a Lan­guage Model with Ped­a­gog­i­cal Theories

Bharath Puranam1 Nov 2024 18:55 UTC
5 points
1 comment13 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

1 Nov 2024 0:10 UTC
17 points
0 comments6 min readLW link
(far.ai)

Cur­rent safety train­ing tech­niques do not fully trans­fer to the agent setting

3 Nov 2024 19:24 UTC
147 points
8 comments5 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
62 points
4 comments14 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
44 points
0 comments1 min readLW link
(arxiv.org)

LLMs Look In­creas­ingly Like Gen­eral Reasoners

eggsyntax8 Nov 2024 23:47 UTC
90 points
45 comments3 min readLW link

Sparks of Consciousness

Charlie Sanders13 Nov 2024 4:58 UTC
2 points
0 comments3 min readLW link
(www.dailymicrofiction.com)

Which AI Safety Bench­mark Do We Need Most in 2025?

17 Nov 2024 23:50 UTC
2 points
2 comments6 min readLW link

[Question] Why is Gem­ini tel­ling the user to die?

Burny18 Nov 2024 1:44 UTC
13 points
1 comment1 min readLW link

Emer­gent Analog­i­cal Rea­son­ing in Large Lan­guage Models

Roman Leventov22 Mar 2023 5:18 UTC
13 points
2 comments1 min readLW link
(arxiv.org)

Does GPT-4 ex­hibit agency when sum­ma­riz­ing ar­ti­cles?

Christopher King24 Mar 2023 15:49 UTC
16 points
2 comments5 min readLW link

More ex­per­i­ments in GPT-4 agency: writ­ing memos

Christopher King24 Mar 2023 17:51 UTC
5 points
2 comments10 min readLW link

GPT-4 al­ign­ing with aca­sual de­ci­sion the­ory when in­structed to play games, but in­cludes a CDT ex­pla­na­tion that’s in­cor­rect if they differ

Christopher King23 Mar 2023 16:16 UTC
7 points
4 comments8 min readLW link

Hut­ter-Prize for Prompts

rokosbasilisk24 Mar 2023 21:26 UTC
5 points
10 comments1 min readLW link

If it quacks like a duck...

RationalMindset26 Mar 2023 18:54 UTC
−4 points
0 comments4 min readLW link

Chronos­ta­sis: The Time-Cap­sule Co­nun­drum of Lan­guage Models

RationalMindset26 Mar 2023 18:54 UTC
−5 points
0 comments1 min readLW link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa Osibodu26 Mar 2023 18:56 UTC
−2 points
0 comments2 min readLW link
(www.researchgate.net)

the ten­sor is a lonely place

jml627 Mar 2023 18:22 UTC
−11 points
0 comments4 min readLW link
(ekjsgrjelrbno.substack.com)

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Peter Hroššo27 Mar 2023 14:36 UTC
13 points
7 comments1 min readLW link

GPT-4 is bad at strate­gic thinking

Christopher King27 Mar 2023 15:11 UTC
22 points
8 comments1 min readLW link

The Prospect of an AI Winter

Erich_Grunewald27 Mar 2023 20:55 UTC
62 points
24 comments15 min readLW link
(www.erichgrunewald.com)

Adapt­ing to Change: Over­com­ing Chronos­ta­sis in AI Lan­guage Models

RationalMindset28 Mar 2023 14:32 UTC
−1 points
0 comments6 min readLW link

Why I Think the Cur­rent Tra­jec­tory of AI Re­search has Low P(doom) - LLMs

GaPa1 Apr 2023 20:35 UTC
2 points
1 comment10 min readLW link

The Quan­ti­za­tion Model of Neu­ral Scaling

nz31 Mar 2023 16:02 UTC
17 points
0 comments1 min readLW link
(arxiv.org)

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC
6 points
4 comments4 min readLW link

Imag­ine a world where Microsoft em­ploy­ees used Bing

Christopher King31 Mar 2023 18:36 UTC
6 points
2 comments2 min readLW link

AI Safety via Luck

Jozdien1 Apr 2023 20:13 UTC
81 points
7 comments11 min readLW link

[Question] Where to be­gin in ML/​AI?

Jake the Student6 Apr 2023 20:45 UTC
9 points
4 comments1 min readLW link

Con­tra LeCun on “Au­tore­gres­sive LLMs are doomed”

rotatingpaguro10 Apr 2023 4:05 UTC
20 points
20 comments8 min readLW link

LW is prob­a­bly not the place for “I asked this LLM (x) and here’s what it said!”, but where is?

lillybaeum12 Apr 2023 10:12 UTC
21 points
3 comments1 min readLW link

[Question] Goals of model vs. goals of simu­lacra?

dr_s12 Apr 2023 13:02 UTC
5 points
7 comments1 min readLW link

Nat­u­ral lan­guage alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC
31 points
2 comments2 min readLW link

Was Homer a stochas­tic par­rot? Mean­ing in liter­ary texts and LLMs

Bill Benzon13 Apr 2023 16:44 UTC
7 points
4 comments3 min readLW link

LLMs and hal­lu­ci­na­tion, like white on rice?

Bill Benzon14 Apr 2023 19:53 UTC
5 points
0 comments3 min readLW link

The Soul of the Writer (on LLMs, the psy­chol­ogy of writ­ers, and the na­ture of in­tel­li­gence)

rogersbacon16 Apr 2023 16:02 UTC
11 points
1 comment3 min readLW link
(www.secretorum.life)

No, re­ally, it pre­dicts next to­kens.

simon18 Apr 2023 3:47 UTC
58 points
55 comments3 min readLW link

An al­ter­na­tive of PPO to­wards alignment

ml hkust17 Apr 2023 17:58 UTC
2 points
2 comments4 min readLW link

A poem writ­ten by a fancy autocomplete

Christopher King20 Apr 2023 2:31 UTC
1 point
0 comments1 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher King20 Apr 2023 19:57 UTC
2 points
7 comments3 min readLW link

Read­abil­ity is mostly a waste of characters

vlad.proex21 Apr 2023 22:05 UTC
21 points
7 comments3 min readLW link

[Question] Could trans­former net­work mod­els learn mo­tor plan­ning like they can learn lan­guage and image gen­er­a­tion?

mu_(negative)23 Apr 2023 17:24 UTC
2 points
4 comments1 min readLW link

A re­sponse to Con­jec­ture’s CoEm proposal

Kristian Freed24 Apr 2023 17:23 UTC
7 points
0 comments4 min readLW link

Im­ple­ment­ing a Trans­former from scratch in PyTorch—a write-up on my experience

Mislav Jurić25 Apr 2023 20:51 UTC
20 points
0 comments10 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob Pfau26 Apr 2023 22:53 UTC
16 points
2 comments2 min readLW link

Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

23 Oct 2023 16:33 UTC
33 points
2 comments11 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek Korbak25 Oct 2023 12:17 UTC
18 points
2 comments5 min readLW link

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

1 Nov 2023 12:46 UTC
18 points
1 comment7 min readLW link

ChatGPT’s On­tolog­i­cal Land­scape

Bill Benzon1 Nov 2023 15:12 UTC
7 points
0 comments4 min readLW link

What are the limits of su­per­in­tel­li­gence?

rainy27 Apr 2023 18:29 UTC
4 points
3 comments5 min readLW link

Pre­face to the Se­quence on LLM Psychology

Quentin FEUILLADE--MONTIXI7 Nov 2023 16:12 UTC
32 points
0 comments2 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

What’s go­ing on? LLMs and IS-A sen­tences

Bill Benzon8 Nov 2023 16:58 UTC
6 points
15 comments4 min readLW link

LLMs and com­pu­ta­tion complexity

Jonathan Marcus28 Apr 2023 17:48 UTC
57 points
29 comments5 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
51 points
0 comments6 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

AISC Pro­ject: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyP13 Nov 2023 14:33 UTC
27 points
0 comments12 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC
1 point
1 comment1 min readLW link

LLMs May Find It Hard to FOOM

RogerDearnaley15 Nov 2023 2:52 UTC
11 points
30 comments12 min readLW link

A con­cep­tual pre­cur­sor to to­day’s lan­guage ma­chines [Shan­non]

Bill Benzon15 Nov 2023 13:50 UTC
24 points
6 comments2 min readLW link

AISC pro­ject: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC
22 points
0 comments4 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill Benzon2 Dec 2023 14:36 UTC
6 points
0 comments3 min readLW link

In­ter­view with Vanessa Kosoy on the Value of The­o­ret­i­cal Re­search for AI

WillPetillo4 Dec 2023 22:58 UTC
37 points
0 comments35 min readLW link

LLM keys—A Pro­posal of a Solu­tion to Prompt In­jec­tion Attacks

Peter Hroššo7 Dec 2023 17:36 UTC
1 point
2 comments1 min readLW link

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC
26 points
9 comments6 min readLW link

LLM cog­ni­tion is prob­a­bly not hu­man-like

Max H8 May 2023 1:22 UTC
26 points
15 comments7 min readLW link

Lan­guage mod­els can ex­plain neu­rons in lan­guage models

nz9 May 2023 17:29 UTC
23 points
0 comments1 min readLW link
(openai.com)

Data and “to­kens” a 30 year old hu­man “trains” on

Jose Miguel Cruz y Celis23 May 2023 5:34 UTC
15 points
15 comments1 min readLW link

PCAST Work­ing Group on Gen­er­a­tive AI In­vites Public Input

Christopher King13 May 2023 22:49 UTC
7 points
0 comments1 min readLW link
(terrytao.wordpress.com)

My cur­rent work­flow to study the in­ter­nal mechanisms of LLM

Yulu Pi16 May 2023 15:27 UTC
4 points
0 comments1 min readLW link

The Com­pleat Cybornaut

19 May 2023 8:44 UTC
64 points
2 comments16 min readLW link

See­ing Ghosts by GPT-4

Christopher King20 May 2023 0:11 UTC
−13 points
0 comments1 min readLW link

Trans­former Ar­chi­tec­ture Choice for Re­sist­ing Prompt In­jec­tion and Jail-Break­ing Attacks

RogerDearnaley21 May 2023 8:29 UTC
9 points
1 comment4 min readLW link

Microsoft and Google us­ing LLMs for Cybersecurity

Phosphorous18 May 2023 17:42 UTC
6 points
0 comments5 min readLW link

Pro­gram­ming AGI is impossible

Áron Ecsenyi30 May 2023 23:05 UTC
1 point
0 comments4 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:08 UTC
12 points
10 comments30 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

Open Source LLMs Can Now Ac­tively Lie

Josh Levy1 Jun 2023 22:03 UTC
6 points
0 comments3 min readLW link

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Miles Turpin3 Jun 2023 0:22 UTC
38 points
8 comments7 min readLW link

[Linkpost] Large Lan­guage Models Con­verge on Brain-Like Word Representations

Bogdan Ionut Cirstea11 Jun 2023 11:20 UTC
36 points
12 comments1 min readLW link

[Linkpost] Scal­ing laws for lan­guage en­cod­ing mod­els in fMRI

Bogdan Ionut Cirstea8 Jun 2023 10:52 UTC
30 points
0 comments1 min readLW link

[Linkpost] Faith and Fate: Limits of Trans­form­ers on Compositionality

Joe Kwon16 Jun 2023 15:04 UTC
19 points
4 comments1 min readLW link
(arxiv.org)

[Linkpost] Map­ping Brains with Lan­guage Models: A Survey

Bogdan Ionut Cirstea16 Jun 2023 9:49 UTC
5 points
0 comments1 min readLW link

OpenAI in­tro­duces func­tion call­ing for GPT-4

20 Jun 2023 1:58 UTC
24 points
3 comments4 min readLW link
(openai.com)

Ele­ments of Com­pu­ta­tional Philos­o­phy, Vol. I: Truth

1 Jul 2023 11:44 UTC
12 points
6 comments1 min readLW link
(compphil.github.io)

[Linkpost] A shared lin­guis­tic space for trans­mit­ting our thoughts from brain to brain in nat­u­ral conversations

Bogdan Ionut Cirstea1 Jul 2023 13:57 UTC
17 points
2 comments1 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

The world where LLMs are possible

Ape in the coat10 Jul 2023 8:00 UTC
20 points
10 comments3 min readLW link