RSS

Lan­guage Models (LLMs)

TagLast edit: Mar 13, 2025, 5:45 PM by Raemon

Language Models are computer programs made to estimate the likelihood of a piece of text. “Hello, how are you?” is likely. “Hello, fnarg horses” is unlikely.

Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. “Q: How are You? A: Very well, thank you” is a likely question-and-answer pair. “Q: How are You? A: Correct horse battery staple” is an unlikely question-and-answer pair.

The language models most relevant to AI safety are language models based on “deep learning”. Deep-learning-based language models can be “trained” to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.

Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.

There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?

See also

Simulators

janusSep 2, 2022, 12:45 PM
631 points
168 comments41 min readLW link8 reviews
(generative.ink)

How LLMs are and are not myopic

janusJul 25, 2023, 2:19 AM
134 points
16 comments8 min readLW link

How it feels to have your mind hacked by an AI

blakedJan 12, 2023, 12:33 AM
362 points
222 comments17 min readLW link

In­verse Scal­ing Prize: Round 1 Winners

Sep 26, 2022, 7:57 PM
93 points
16 comments4 min readLW link
(irmckenzie.co.uk)

Align­ment Im­pli­ca­tions of LLM Suc­cesses: a De­bate in One Act

Zack_M_DavisOct 21, 2023, 3:22 PM
258 points
55 comments13 min readLW link2 reviews

On the fu­ture of lan­guage models

owencbDec 20, 2023, 4:58 PM
105 points
17 comments1 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
64 points
30 comments11 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
60 points
39 comments24 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM
47 points
8 comments36 min readLW link

A Chi­nese Room Con­tain­ing a Stack of Stochas­tic Parrots

RogerDearnaleyJan 12, 2024, 6:29 AM
20 points
3 comments5 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM
15 points
15 comments13 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaleyJan 11, 2024, 12:56 PM
35 points
4 comments39 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points
4 comments2 min readLW link

Trans­former Circuits

evhubDec 22, 2021, 9:09 PM
144 points
4 comments3 min readLW link
(transformer-circuits.pub)

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM
627 points
188 comments16 min readLW link

Rep­re­sen­ta­tion Tuning

Christopher AckermanJun 27, 2024, 5:44 PM
35 points
9 comments13 min readLW link

AI Safety Chatbot

Dec 21, 2023, 2:06 PM
61 points
11 comments4 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM
9 points
0 comments11 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

Feb 5, 2023, 10:02 PM
681 points
206 comments12 min readLW link1 review

Pro­gram­ming Re­fusal with Con­di­tional Ac­ti­va­tion Steering

Bruce W. LeeSep 11, 2024, 8:57 PM
41 points
0 comments11 min readLW link
(brucewlee.com)

Large Lan­guage Models will be Great for Censorship

Ethan EdwardsAug 21, 2023, 7:03 PM
183 points
14 comments8 min readLW link
(ethanedwards.substack.com)

Test­ing PaLM prompts on GPT3

YitzApr 6, 2022, 5:21 AM
103 points
14 comments8 min readLW link

Re­sults from the lan­guage model hackathon

Esben KranOct 10, 2022, 8:29 AM
22 points
1 comment4 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon LermenMay 11, 2024, 12:08 AM
51 points
14 comments7 min readLW link

In­verse Scal­ing Prize: Se­cond Round Winners

Jan 24, 2023, 8:12 PM
58 points
17 comments15 min readLW link

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTroutMar 2, 2025, 7:51 PM
149 points
22 comments1 min readLW link
(turntrout.com)

LLMs may cap­ture key com­po­nents of hu­man agency

catubcNov 17, 2022, 8:14 PM
27 points
0 comments4 min readLW link

In­vo­ca­tions: The Other Ca­pa­bil­ities Over­hang?

Robert_AIZIApr 4, 2023, 1:38 PM
29 points
4 comments4 min readLW link
(aizi.substack.com)

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina PanicksseryAug 9, 2023, 7:06 AM
69 points
20 comments12 min readLW link

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean OsierJun 30, 2024, 2:48 AM
12 points
5 comments6 min readLW link
(github.com)

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_HiltonJan 17, 2022, 4:49 PM
65 points
14 comments13 min readLW link

LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

NickyPMar 26, 2023, 9:57 PM
99 points
3 comments41 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

May 3, 2023, 1:30 PM
33 points
6 comments2 min readLW link1 review
(arxiv.org)

[Question] If I ask an LLM to think step by step, how big are the steps?

ryan_bSep 13, 2024, 8:30 PM
7 points
1 comment1 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom LieberumDec 24, 2021, 6:05 PM
16 points
2 comments3 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points
37 comments5 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan HJul 9, 2024, 3:40 PM
36 points
1 comment1 min readLW link
(rdi.berkeley.edu)

Knowl­edge, Rea­son­ing, and Superintelligence

owencbMar 26, 2025, 11:28 PM
7 points
0 comments7 min readLW link
(strangecities.substack.com)

A one-ques­tion Tur­ing test for GPT-3

Jan 22, 2022, 6:17 PM
85 points
25 comments5 min readLW link

See­ing Through the Eyes of the Algorithm

silentbobFeb 22, 2025, 11:54 AM
17 points
1 comment10 min readLW link

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

Nov 16, 2023, 8:18 PM
45 points
3 comments1 min readLW link
(arxiv.org)

Role em­bed­dings: mak­ing au­thor­ship more salient to LLMs

Jan 7, 2025, 8:13 PM
50 points
0 comments8 min readLW link

LLMs as a Plan­ning Overhang

LarksJul 14, 2024, 2:54 AM
38 points
8 comments2 min readLW link

How do LLMs give truth­ful an­swers? A dis­cus­sion of LLM vs. hu­man rea­son­ing, en­sem­bles & parrots

Owain_EvansMar 28, 2024, 2:34 AM
26 points
0 comments9 min readLW link

New, im­proved mul­ti­ple-choice TruthfulQA

Jan 15, 2025, 11:32 PM
72 points
0 comments3 min readLW link

[ASoT] Some thoughts about LM monologue limi­ta­tions and ELK

leogaoMar 30, 2022, 2:26 PM
10 points
0 comments2 min readLW link

Pro­ce­du­rally eval­u­at­ing fac­tual ac­cu­racy: a re­quest for research

Jacob_HiltonMar 30, 2022, 4:37 PM
25 points
2 comments6 min readLW link

[Link] Train­ing Com­pute-Op­ti­mal Large Lan­guage Models

nostalgebraistMar 31, 2022, 6:01 PM
51 points
23 comments1 min readLW link
(arxiv.org)

In­flec­tion AI: New startup re­lated to lan­guage models

NisanApr 2, 2022, 5:35 AM
21 points
1 comment1 min readLW link

New Scal­ing Laws for Large Lan­guage Models

1a3ornApr 1, 2022, 8:41 PM
246 points
22 comments5 min readLW link

Ex­trap­o­lat­ing from Five Words

Gordon Seidoh WorleyNov 15, 2023, 11:21 PM
40 points
11 comments2 min readLW link

Lin­ear en­cod­ing of char­ac­ter-level in­for­ma­tion in GPT-J to­ken embeddings

Nov 10, 2023, 10:19 PM
34 points
4 comments28 min readLW link

How to train your trans­former

p.b.Apr 7, 2022, 9:34 AM
6 points
0 comments8 min readLW link

En­hanc­ing biose­cu­rity with lan­guage mod­els: defin­ing re­search directions

micMar 26, 2024, 12:30 PM
12 points
0 comments1 min readLW link
(papers.ssrn.com)

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
64 points
3 comments1 min readLW link
(turntrout.com)

Lan­guage Model Tools for Align­ment Research

Logan RiggsApr 8, 2022, 5:32 PM
28 points
0 comments2 min readLW link

Num­ber­wang: LLMs Do­ing Au­tonomous Re­search, and a Call for Input

Jan 16, 2025, 5:20 PM
70 points
30 comments31 min readLW link

AMA Con­jec­ture, A New Align­ment Startup

adamShimiApr 9, 2022, 9:43 AM
47 points
42 comments1 min readLW link

The “Rev­er­sal Curse”: you still aren’t antropo­mor­phis­ing enough.

lumpenspaceMar 13, 2025, 10:24 AM
3 points
0 comments1 min readLW link
(lumpenspace.substack.com)

[Linkpost] New multi-modal Deep­mind model fus­ing Chin­chilla with images and videos

p.b.Apr 30, 2022, 3:47 AM
53 points
18 comments1 min readLW link

Claude 3 claims it’s con­scious, doesn’t want to die or be modified

Mikhail SaminMar 4, 2024, 11:05 PM
79 points
116 comments14 min readLW link

Paper: Teach­ing GPT3 to ex­press un­cer­tainty in words

Owain_EvansMay 31, 2022, 1:27 PM
97 points
7 comments4 min readLW link

Boot­strap­ping Lan­guage Models

harsimonyMay 27, 2022, 7:43 PM
7 points
5 comments2 min readLW link

Study­ing The Alien Mind

Dec 5, 2023, 5:27 PM
80 points
10 comments15 min readLW link

Tell me about your­self: LLMs are aware of their learned behaviors

Jan 22, 2025, 12:47 AM
129 points
5 comments6 min readLW link

Test­ing for par­allel rea­son­ing in LLMs

May 19, 2024, 3:28 PM
9 points
7 comments9 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM
28 points
7 comments8 min readLW link1 review

Lamda is not an LLM

KevinJun 19, 2022, 11:13 AM
7 points
10 comments1 min readLW link
(www.wired.com)

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

May 7, 2023, 12:46 AM
77 points
24 comments11 min readLW link

Con­di­tion­ing Gen­er­a­tive Models

Adam JermynJun 25, 2022, 10:15 PM
24 points
18 comments10 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

AnnahFeb 5, 2024, 5:51 PM
73 points
8 comments7 min readLW link

An­thropic re­lease Claude 3, claims >GPT-4 Performance

LawrenceCMar 4, 2024, 6:23 PM
115 points
41 comments2 min readLW link
(www.anthropic.com)

Assess­ing AlephAlphas Mul­ti­modal Model

p.b.Jun 28, 2022, 9:28 AM
30 points
5 comments3 min readLW link

New OpenAI Paper—Lan­guage mod­els can ex­plain neu­rons in lan­guage models

MrThinkMay 10, 2023, 7:46 AM
47 points
14 comments1 min readLW link

[Linkpost] Solv­ing Quan­ti­ta­tive Rea­son­ing Prob­lems with Lan­guage Models

YitzJun 30, 2022, 6:58 PM
76 points
15 comments2 min readLW link
(storage.googleapis.com)

Minerva

AlgonJul 1, 2022, 8:06 PM
36 points
6 comments2 min readLW link
(ai.googleblog.com)

Deep learn­ing cur­ricu­lum for large lan­guage model alignment

Jacob_HiltonJul 13, 2022, 9:58 PM
57 points
3 comments1 min readLW link
(github.com)

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
59 points
8 comments20 min readLW link

AGI will be made of het­ero­ge­neous com­po­nents, Trans­former and Selec­tive SSM blocks will be among them

Roman LeventovDec 27, 2023, 2:51 PM
33 points
9 comments4 min readLW link

[Question] Im­pact of ” ‘Let’s think step by step’ is all you need”?

yrimonJul 24, 2022, 8:59 PM
20 points
2 comments1 min readLW link

chin­chilla’s wild implications

nostalgebraistJul 31, 2022, 1:18 AM
424 points
128 comments10 min readLW link1 review

LLM Guardrails Should Have Bet­ter Cus­tomer Ser­vice Tuning

Jiao BuMay 13, 2023, 10:54 PM
2 points
0 comments2 min readLW link

Emer­gent Abil­ities of Large Lan­guage Models [Linkpost]

aogAug 10, 2022, 6:02 PM
25 points
2 comments1 min readLW link
(arxiv.org)

Lan­guage mod­els seem to be much bet­ter than hu­mans at next-to­ken prediction

Aug 11, 2022, 5:45 PM
182 points
60 comments13 min readLW link1 review

A lit­tle play­ing around with Blen­der­bot3

Nathan Helm-BurgerAug 12, 2022, 4:06 PM
9 points
0 comments1 min readLW link

[Question] Is there a ‘time se­ries fore­cast­ing’ equiv­a­lent of AIXI?

Solenoid_EntityMay 17, 2023, 4:35 AM
12 points
2 comments1 min readLW link

[Question] Are lan­guage mod­els close to the su­per­hu­man level in philos­o­phy?

Roman LeventovAug 19, 2022, 4:43 AM
6 points
2 comments2 min readLW link

Re­search Dis­cus­sion on PSCA with Claude Son­net 3.5

Robert KralischJul 24, 2024, 4:53 PM
−2 points
0 comments25 min readLW link

A Test for Lan­guage Model Consciousness

Ethan PerezAug 25, 2022, 7:41 PM
18 points
14 comments9 min readLW link

Strat­egy For Con­di­tion­ing Gen­er­a­tive Models

Sep 1, 2022, 4:34 AM
31 points
4 comments18 min readLW link

And All the Shog­goths Merely Players

Zack_M_DavisFeb 10, 2024, 7:56 PM
170 points
57 comments12 min readLW link

Proof-of-Con­cept De­bug­ger for a Small LLM

Mar 17, 2025, 10:27 PM
20 points
0 comments11 min readLW link

Alex­aTM − 20 Billion Pa­ram­e­ter Model With Im­pres­sive Performance

MrThinkSep 9, 2022, 9:46 PM
5 points
0 comments1 min readLW link

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllisSep 17, 2022, 7:48 PM
19 points
13 comments3 min readLW link

“AI achieves silver-medal stan­dard solv­ing In­ter­na­tional Math­e­mat­i­cal Olympiad prob­lems”

gjmJul 25, 2024, 3:58 PM
133 points
38 comments2 min readLW link
(deepmind.google)

Why I Believe LLMs Do Not Have Hu­man-like Emotions

OneManyNoneMay 22, 2023, 3:46 PM
13 points
6 comments7 min readLW link

[Question] Sup­pos­ing the 1bit LLM pa­per pans out

O OFeb 29, 2024, 5:31 AM
27 points
11 comments1 min readLW link

PaLM-2 & GPT-4 in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas FinnvedenMay 30, 2023, 6:33 PM
57 points
6 comments6 min readLW link

LIMA: Less Is More for Alignment

Ulisse MiniMay 30, 2023, 5:10 PM
16 points
6 comments1 min readLW link
(arxiv.org)

Claude 3 Opus can op­er­ate as a Tur­ing machine

Gunnar_ZarnckeApr 17, 2024, 8:41 AM
36 points
2 comments1 min readLW link
(twitter.com)

An ex­am­i­na­tion of GPT-2′s bor­ing yet effec­tive glitch

MiguelDevApr 18, 2024, 5:26 AM
5 points
3 comments3 min readLW link

Why did ChatGPT say that? Prompt en­g­ineer­ing and more, with PIZZA.

Jessica RumbelowAug 3, 2024, 12:07 PM
41 points
2 comments4 min readLW link

“LLMs Don’t Have a Co­her­ent Model of the World”—What it Means, Why it Mat­ters

DavidmanheimJun 1, 2023, 7:46 AM
32 points
2 comments7 min readLW link

I didn’t think I’d take the time to build this cal­ibra­tion train­ing game, but with web­sim it took roughly 30 sec­onds, so here it is!

mako yassAug 2, 2024, 10:35 PM
24 points
2 comments5 min readLW link

What’s up with all the non-Mor­mons? Weirdly spe­cific uni­ver­sal­ities across LLMs

mwatkinsApr 19, 2024, 1:43 PM
40 points
13 comments27 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

Dec 5, 2022, 8:28 PM
40 points
19 comments10 min readLW link

LEAst-squares Con­cept Era­sure (LEACE)

tricky_labyrinthJun 7, 2023, 9:51 PM
68 points
10 comments1 min readLW link
(twitter.com)

Did ChatGPT just gaslight me?

TW123Dec 1, 2022, 5:41 AM
123 points
45 comments9 min readLW link
(aiwatchtower.substack.com)

What o3 Be­comes by 2028

Vladimir_NesovDec 22, 2024, 12:37 PM
143 points
15 comments5 min readLW link

Me­taAI: less is less for al­ign­ment.

Cleo NardoJun 13, 2023, 2:08 PM
71 points
17 comments5 min readLW link

Ex­per­i­ments in Eval­u­at­ing Steer­ing Vectors

Gytis DaujotasJun 19, 2023, 3:11 PM
34 points
4 comments4 min readLW link

Policy for LLM Writ­ing on LessWrong

jimrandomhMar 24, 2025, 9:41 PM
297 points
52 comments2 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
100 points
9 comments2 min readLW link
(arxiv.org)

[Question] Does a LLM have a util­ity func­tion?

DagonDec 9, 2022, 5:19 PM
17 points
11 comments1 min readLW link

“text­books are all you need”

bhauthJun 21, 2023, 5:06 PM
66 points
18 comments2 min readLW link
(arxiv.org)

Re­la­tional Speaking

jefftkJun 21, 2023, 2:40 PM
11 points
0 comments2 min readLW link
(www.jefftk.com)

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

XodarapDec 14, 2022, 12:32 PM
45 points
1 comment1 min readLW link
(arxiv.org)

Us­ing Claude to con­vert di­a­log tran­scripts into great posts?

mako yassJun 21, 2023, 8:19 PM
6 points
4 comments4 min readLW link

Take 11: “Align­ing lan­guage mod­els” should be weirder.

Charlie SteinerDec 18, 2022, 2:14 PM
34 points
0 comments2 min readLW link

Ex­plor­ing the pe­ter­todd /​ Leilan du­al­ity in GPT-2 and GPT-J

mwatkinsDec 23, 2024, 1:17 PM
12 points
1 comment17 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David UdellJul 9, 2024, 10:06 PM
53 points
7 comments7 min readLW link

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

Dec 20, 2022, 8:08 PM
100 points
34 comments1 min readLW link
(www.anthropic.com)

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

AkashDec 20, 2022, 9:39 PM
18 points
2 comments11 min readLW link

Map­ping the se­man­tic void II: Above, be­low and be­tween to­ken em­bed­dings

mwatkinsFeb 15, 2024, 11:00 PM
31 points
4 comments10 min readLW link

Dou­glas Hofs­tadter changes his mind on Deep Learn­ing & AI risk (June 2023)?

gwernJul 3, 2023, 12:48 AM
426 points
54 comments7 min readLW link
(www.youtube.com)

Ag­grega­tive Prin­ci­ples of So­cial Justice

Cleo NardoJun 5, 2024, 1:44 PM
29 points
10 comments37 min readLW link

Mlyyrczo

lsusrDec 26, 2022, 7:58 AM
41 points
14 comments3 min readLW link

Goal-Direc­tion for Si­mu­lated Agents

Raymond DJul 12, 2023, 5:06 PM
33 points
2 comments6 min readLW link

Gary Mar­cus now say­ing AI can’t do things it can already do

Benjamin_ToddFeb 9, 2025, 12:24 PM
61 points
12 comments1 min readLW link
(benjamintodd.substack.com)

‘simu­la­tor’ fram­ing and con­fu­sions about LLMs

Beth BarnesDec 31, 2022, 11:38 PM
104 points
11 comments4 min readLW link

So­ci­aLLM: pro­posal for a lan­guage model de­sign for per­son­al­ised apps, so­cial sci­ence, and AI safety research

Roman LeventovDec 19, 2023, 4:49 PM
17 points
5 comments3 min readLW link

What’s up with LLMs rep­re­sent­ing XORs of ar­bi­trary fea­tures?

Sam MarksJan 3, 2024, 7:44 PM
158 points
62 comments16 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM
51 points
1 comment3 min readLW link

Case for Foun­da­tion Models be­yond English

Varshul GuptaJul 21, 2023, 1:59 PM
1 point
0 comments3 min readLW link
(dubverseblack.substack.com)

[Linkpost] Play with SAEs on Llama 3

Sep 25, 2024, 10:35 PM
40 points
2 comments1 min readLW link

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM
5 points
2 comments4 min readLW link

Pro­posal for In­duc­ing Steganog­ra­phy in LMs

Logan RiggsJan 12, 2023, 10:15 PM
22 points
3 comments2 min readLW link

[Linkpost] Scal­ing Laws for Gen­er­a­tive Mixed-Mo­dal Lan­guage Models

Amal Jan 12, 2023, 2:24 PM
15 points
2 comments1 min readLW link
(arxiv.org)

Water­mark­ing con­sid­ered over­rated?

DanielFilanJul 31, 2023, 9:36 PM
19 points
4 comments1 min readLW link

[Question] Ba­sic Ques­tion about LLMs: how do they know what task to perform

GarakJan 14, 2023, 1:13 PM
1 point
3 comments1 min readLW link

Un­der­stand­ing the diffu­sion of large lan­guage mod­els: summary

Ben CottierJan 16, 2023, 1:37 AM
26 points
1 comment1 min readLW link

Lan­guage mod­els can gen­er­ate su­pe­rior text com­pared to their input

ChristianKlJan 17, 2023, 10:57 AM
48 points
28 comments1 min readLW link

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

William_SJan 19, 2023, 7:49 PM
32 points
4 comments2 min readLW link

GPT-4 can catch sub­tle cross-lan­guage trans­la­tion mistakes

Michael TontchevJul 27, 2023, 1:39 AM
7 points
1 comment1 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina PanicksseryJul 28, 2023, 2:46 AM
122 points
18 comments9 min readLW link1 review

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
207 points
43 comments45 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

Feb 2, 2023, 8:28 PM
88 points
4 comments13 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

Feb 2, 2023, 8:28 PM
72 points
15 comments57 min readLW link

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

Dec 19, 2023, 7:14 PM
45 points
4 comments6 min readLW link
(arxiv.org)

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

Feb 6, 2023, 8:08 PM
20 points
3 comments11 min readLW link

′ pe­ter­todd’’s last stand: The fi­nal days of open GPT-3 research

mwatkinsJan 22, 2024, 6:47 PM
109 points
16 comments45 min readLW link

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

Feb 6, 2023, 7:09 PM
113 points
45 comments13 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

Dec 9, 2023, 2:27 AM
70 points
5 comments10 min readLW link

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyPFeb 13, 2023, 6:52 PM
82 points
11 comments15 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

Feb 8, 2023, 6:19 PM
32 points
2 comments11 min readLW link

Notes on the Math­e­mat­ics of LLM Architectures

carboniferous_umbraculum Feb 9, 2023, 1:45 AM
13 points
2 comments1 min readLW link
(drive.google.com)

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

Feb 9, 2023, 8:59 PM
28 points
0 comments10 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkinsOct 4, 2024, 10:15 PM
37 points
5 comments6 min readLW link

In Defense of Chat­bot Romance

Kaj_SotalaFeb 11, 2023, 2:30 PM
123 points
52 comments11 min readLW link
(kajsotala.fi)

[Question] Is In­struc­tGPT Fol­low­ing In­struc­tions in Other Lan­guages Sur­pris­ing?

DragonGodFeb 13, 2023, 11:26 PM
39 points
15 comments1 min readLW link

Bing Chat is blatantly, ag­gres­sively misaligned

evhubFeb 15, 2023, 5:29 AM
403 points
181 comments2 min readLW link1 review

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
318 points
29 comments18 min readLW link1 review

Mus­ings on Text Data Wall (Oct 2024)

Vladimir_NesovOct 5, 2024, 7:00 PM
40 points
2 comments5 min readLW link

Nav­i­gat­ing LLM em­bed­ding spaces us­ing archetype-based directions

mwatkinsMay 8, 2024, 5:54 AM
15 points
4 comments28 min readLW link

[Question] Why no ma­jor LLMs with mem­ory?

Kaj_SotalaMar 28, 2023, 4:34 PM
42 points
15 comments1 min readLW link

Cor­rigi­bil­ity, Self-Dele­tion, and Iden­ti­cal Strawberries

Robert_AIZIMar 28, 2023, 4:54 PM
9 points
2 comments6 min readLW link
(aizi.substack.com)

Three of my be­liefs about up­com­ing AGI

Robert_AIZIMar 27, 2023, 8:27 PM
6 points
0 comments3 min readLW link
(aizi.substack.com)

[Question] Which parts of the ex­ist­ing in­ter­net are already likely to be in (GPT-5/​other soon-to-be-trained LLMs)’s train­ing cor­pus?

AnnaSalamonMar 29, 2023, 5:17 AM
49 points
2 comments1 min readLW link

Role Ar­chi­tec­tures: Ap­ply­ing LLMs to con­se­quen­tial tasks

Eric DrexlerMar 30, 2023, 3:00 PM
60 points
7 comments9 min readLW link

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

Sep 4, 2023, 12:54 PM
109 points
16 comments5 min readLW link
(arxiv.org)

Ex­am­ples of How I Use LLMs

jefftkOct 14, 2024, 5:10 PM
29 points
2 comments2 min readLW link
(www.jefftk.com)

What do lan­guage mod­els know about fic­tional char­ac­ters?

skybrianFeb 22, 2023, 5:58 AM
6 points
0 comments4 min readLW link

LLM Ap­pli­ca­tions I Want To See

sarahconstantinAug 19, 2024, 9:10 PM
102 points
6 comments8 min readLW link
(sarahconstantin.substack.com)

Meta “open sources” LMs com­pet­i­tive with Chin­chilla, PaLM, and code-davinci-002 (Paper)

LawrenceCFeb 24, 2023, 7:57 PM
38 points
19 comments1 min readLW link
(research.facebook.com)

A Pro­posed Test to Deter­mine the Ex­tent to Which Large Lan­guage Models Un­der­stand the Real World

Bruce GFeb 24, 2023, 8:20 PM
4 points
7 comments8 min readLW link

Evil au­to­com­plete: Ex­is­ten­tial Risk and Next-To­ken Predictors

YitzFeb 28, 2023, 8:47 AM
9 points
3 comments5 min readLW link

Emer­gent Misal­ign­ment: Nar­row fine­tun­ing can pro­duce broadly mis­al­igned LLMs

Feb 25, 2025, 5:39 PM
323 points
91 comments4 min readLW link

Why keep a di­ary, and why wish for large lan­guage models

DanielFilanJun 14, 2024, 4:10 PM
9 points
1 comment2 min readLW link
(danielfilan.com)

Map­ping the se­man­tic void: Strange go­ings-on in GPT em­bed­ding spaces

mwatkinsDec 14, 2023, 1:10 PM
114 points
31 comments14 min readLW link

Google’s PaLM-E: An Em­bod­ied Mul­ti­modal Lan­guage Model

SandXboxMar 7, 2023, 4:11 AM
87 points
7 comments1 min readLW link
(palm-e.github.io)

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

Max HOct 11, 2023, 12:53 AM
35 points
5 comments11 min readLW link

In­fer­ring the model di­men­sion of API-pro­tected LLMs

Ege ErdilMar 18, 2024, 6:19 AM
34 points
3 comments4 min readLW link
(arxiv.org)

Pre-reg­is­ter­ing a study

Robert_AIZIApr 7, 2023, 3:46 PM
10 points
0 comments6 min readLW link
(aizi.substack.com)

Up­com­ing Changes in Large Lan­guage Models

Andrew Keenan RichardsonApr 8, 2023, 3:41 AM
43 points
8 comments4 min readLW link
(mechanisticmind.com)

GPT can write Quines now (GPT-4)

Andrew_CritchMar 14, 2023, 7:18 PM
112 points
30 comments1 min readLW link

No­kens: A po­ten­tial method of in­ves­ti­gat­ing glitch tokens

HoagyMar 15, 2023, 4:23 PM
21 points
0 comments4 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

Sep 21, 2023, 3:30 PM
159 points
8 comments5 min readLW link

Can I take ducks home from the park?

dynomightSep 14, 2023, 9:03 PM
67 points
8 comments3 min readLW link
(dynomight.net)

[Question] Will 2023 be the last year you can write short sto­ries and re­ceive most of the in­tel­lec­tual credit for writ­ing them?

lcMar 16, 2023, 9:36 PM
20 points
11 comments1 min readLW link

Su­per-Luigi = Luigi + (Luigi—Waluigi)

AlexeiMar 17, 2023, 3:27 PM
16 points
9 comments1 min readLW link

LLM-Se­cured Sys­tems: A Gen­eral-Pur­pose Tool For Struc­tured Transparency

ozziegooenJun 18, 2024, 12:21 AM
10 points
1 comment1 min readLW link

Does Chat-GPT dis­play ‘Scope Insen­si­tivity’?

callumDec 7, 2023, 6:58 PM
11 points
0 comments3 min readLW link

Scaf­folded LLMs as nat­u­ral lan­guage computers

berenApr 12, 2023, 10:47 AM
95 points
10 comments11 min readLW link

What does it mean for an LLM such as GPT to be al­igned /​ good /​ pos­i­tive im­pact?

PashaKamyshevMar 20, 2023, 9:21 AM
4 points
3 comments10 min readLW link

Jailbreak steer­ing generalization

Jun 20, 2024, 5:25 PM
41 points
4 comments2 min readLW link
(arxiv.org)

RLHF does not ap­pear to differ­en­tially cause mode-collapse

Mar 20, 2023, 3:39 PM
95 points
9 comments3 min readLW link

Claude 3.5 Sonnet

Zach Stein-PerlmanJun 20, 2024, 6:00 PM
75 points
41 comments1 min readLW link
(www.anthropic.com)

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points
3 comments19 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

May 13, 2023, 6:42 PM
437 points
98 comments50 min readLW link1 review

Paper: LLMs trained on “A is B” fail to learn “B is A”

Sep 23, 2023, 7:55 PM
121 points
74 comments4 min readLW link
(arxiv.org)

Teach­ing Claude to Meditate

Gordon Seidoh WorleyDec 29, 2024, 10:27 PM
−5 points
4 comments23 min readLW link

Some Quick Fol­low-Up Ex­per­i­ments to “Taken out of con­text: On mea­sur­ing situ­a­tional aware­ness in LLMs”

Miles TurpinOct 3, 2023, 2:22 AM
31 points
0 comments9 min readLW link

The ‘ pe­ter­todd’ phenomenon

mwatkinsApr 15, 2023, 12:59 AM
192 points
50 comments38 min readLW link1 review

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanBOct 4, 2023, 5:10 PM
97 points
8 comments3 min readLW link

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM
71 points
18 comments6 min readLW link

Owain Evans on Si­tu­a­tional Aware­ness and Out-of-Con­text Rea­son­ing in LLMs

Michaël TrazziAug 24, 2024, 4:30 AM
55 points
0 comments5 min readLW link

“On the Im­pos­si­bil­ity of Su­per­in­tel­li­gent Ru­bik’s Cube Solvers”, Claude 2024 [hu­mor]

gwernJun 23, 2024, 9:18 PM
22 points
6 comments1 min readLW link
(gwern.net)

Dens­ing Law of LLMs

Bogdan Ionut CirsteaDec 8, 2024, 7:35 PM
9 points
2 comments1 min readLW link
(arxiv.org)

On Claude 3.5 Sonnet

ZviJun 24, 2024, 12:00 PM
95 points
14 comments13 min readLW link
(thezvi.wordpress.com)

Claude Doesn’t Want to Die

garrisonMar 5, 2024, 6:00 AM
22 points
3 comments1 min readLW link
(garrisonlovely.substack.com)

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM
119 points
15 comments22 min readLW link

Do LLMs dream of emer­gent sheep?

ShmiApr 24, 2023, 3:26 AM
16 points
2 comments1 min readLW link

Eleuther re­leases Llemma: An Open Lan­guage Model For Mathematics

mako yassOct 17, 2023, 8:03 PM
22 points
0 comments1 min readLW link
(blog.eleuther.ai)

[Linkpost] Vague Ver­biage in Forecasting

trevorMar 22, 2024, 6:05 PM
11 points
9 comments3 min readLW link
(goodjudgment.com)

Covert Mal­i­cious Finetuning

Jul 2, 2024, 2:41 AM
89 points
4 comments3 min readLW link

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

Oct 24, 2023, 12:30 AM
66 points
0 comments2 min readLW link
(arxiv.org)

Ro­mance, mi­s­un­der­stand­ing, so­cial stances, and the hu­man LLM

Kaj_SotalaApr 27, 2023, 12:59 PM
74 points
32 comments16 min readLW link

AI doom from an LLM-plateau-ist perspective

Steven ByrnesApr 27, 2023, 1:58 PM
158 points
24 comments6 min readLW link

Mus­ings on LLM Scale (Jul 2024)

Vladimir_NesovJul 3, 2024, 6:35 PM
34 points
0 comments3 min readLW link

Cog­ni­tive Bi­ases in Large Lan­guage Models

JanSep 25, 2021, 8:59 PM
18 points
3 comments12 min readLW link
(universalprior.substack.com)

NVIDIA and Microsoft re­leases 530B pa­ram­e­ter trans­former model, Me­ga­tron-Tur­ing NLG

OzyrusOct 11, 2021, 3:28 PM
51 points
36 comments1 min readLW link
(developer.nvidia.com)

NLP Po­si­tion Paper: When Com­bat­ting Hype, Pro­ceed with Caution

Sam BowmanOct 15, 2021, 8:57 PM
46 points
14 comments1 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM
61 points
24 comments10 min readLW link
(aizi.substack.com)

Fore­cast­ing progress in lan­guage models

Oct 28, 2021, 8:40 PM
62 points
6 comments12 min readLW link
(www.metaculus.com)

The Stochas­tic Par­rot Hy­poth­e­sis is de­bat­able for the last gen­er­a­tion of LLMs

Nov 7, 2023, 4:12 PM
52 points
21 comments6 min readLW link

Deep­mind’s Go­pher—more pow­er­ful than GPT-3

hathDec 8, 2021, 5:06 PM
86 points
26 comments1 min readLW link
(deepmind.com)

Teaser: Hard-cod­ing Trans­former Models

MadHatterDec 12, 2021, 10:04 PM
74 points
19 comments1 min readLW link

Lan­guage Model Align­ment Re­search Internships

Ethan PerezDec 13, 2021, 7:53 PM
74 points
1 comment1 min readLW link

A con­cep­tual pre­cur­sor to to­day’s lan­guage ma­chines [Shan­non]

Bill BenzonNov 15, 2023, 1:50 PM
24 points
6 comments2 min readLW link

AISC pro­ject: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM
22 points
0 comments4 min readLW link

What is scaf­fold­ing?

Mar 27, 2025, 9:06 AM
10 points
0 comments2 min readLW link
(aisafety.info)

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill BenzonDec 2, 2023, 2:36 PM
6 points
0 comments3 min readLW link

In­ter­view with Vanessa Kosoy on the Value of The­o­ret­i­cal Re­search for AI

WillPetilloDec 4, 2023, 10:58 PM
37 points
0 comments35 min readLW link

LLM keys—A Pro­posal of a Solu­tion to Prompt In­jec­tion Attacks

Peter HroššoDec 7, 2023, 5:36 PM
1 point
2 comments1 min readLW link

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM
26 points
9 comments6 min readLW link

LLM cog­ni­tion is prob­a­bly not hu­man-like

Max HMay 8, 2023, 1:22 AM
26 points
15 comments7 min readLW link

Lan­guage mod­els can ex­plain neu­rons in lan­guage models

nzMay 9, 2023, 5:29 PM
23 points
0 comments1 min readLW link
(openai.com)

Data and “to­kens” a 30 year old hu­man “trains” on

Jose Miguel Cruz y CelisMay 23, 2023, 5:34 AM
15 points
15 comments1 min readLW link

PCAST Work­ing Group on Gen­er­a­tive AI In­vites Public Input

Christopher KingMay 13, 2023, 10:49 PM
7 points
0 comments1 min readLW link
(terrytao.wordpress.com)

My cur­rent work­flow to study the in­ter­nal mechanisms of LLM

Yulu PiMay 16, 2023, 3:27 PM
4 points
0 comments1 min readLW link

The Com­pleat Cybornaut

May 19, 2023, 8:44 AM
65 points
2 comments16 min readLW link

See­ing Ghosts by GPT-4

Christopher KingMay 20, 2023, 12:11 AM
−13 points
0 comments1 min readLW link

Trans­former Ar­chi­tec­ture Choice for Re­sist­ing Prompt In­jec­tion and Jail-Break­ing Attacks

RogerDearnaleyMay 21, 2023, 8:29 AM
9 points
1 comment4 min readLW link

Microsoft and Google us­ing LLMs for Cybersecurity

PhosphorousMay 18, 2023, 5:42 PM
6 points
0 comments5 min readLW link

Pro­gram­ming AGI is impossible

Áron EcsenyiMay 30, 2023, 11:05 PM
1 point
0 comments4 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:08 AM
12 points
10 comments30 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points
0 comments12 min readLW link

Open Source LLMs Can Now Ac­tively Lie

Josh LevyJun 1, 2023, 10:03 PM
6 points
0 comments3 min readLW link

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Miles TurpinJun 3, 2023, 12:22 AM
42 points
8 comments7 min readLW link

[Linkpost] Large Lan­guage Models Con­verge on Brain-Like Word Representations

Bogdan Ionut CirsteaJun 11, 2023, 11:20 AM
36 points
12 comments1 min readLW link

[Linkpost] Scal­ing laws for lan­guage en­cod­ing mod­els in fMRI

Bogdan Ionut CirsteaJun 8, 2023, 10:52 AM
30 points
0 comments1 min readLW link

[Linkpost] Faith and Fate: Limits of Trans­form­ers on Compositionality

Joe KwonJun 16, 2023, 3:04 PM
19 points
4 comments1 min readLW link
(arxiv.org)

[Linkpost] Map­ping Brains with Lan­guage Models: A Survey

Bogdan Ionut CirsteaJun 16, 2023, 9:49 AM
5 points
0 comments1 min readLW link

OpenAI in­tro­duces func­tion call­ing for GPT-4

Jun 20, 2023, 1:58 AM
24 points
3 comments4 min readLW link
(openai.com)

Ele­ments of Com­pu­ta­tional Philos­o­phy, Vol. I: Truth

Jul 1, 2023, 11:44 AM
12 points
6 comments1 min readLW link
(compphil.github.io)

[Linkpost] A shared lin­guis­tic space for trans­mit­ting our thoughts from brain to brain in nat­u­ral conversations

Bogdan Ionut CirsteaJul 1, 2023, 1:57 PM
17 points
2 comments1 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points
0 comments2 min readLW link

The world where LLMs are possible

Ape in the coatJul 10, 2023, 8:00 AM
20 points
10 comments3 min readLW link

Quick Thoughts on Lan­guage Models

RohanSJul 18, 2023, 8:38 PM
6 points
0 comments4 min readLW link

Un­safe AI as Dy­nam­i­cal Systems

Robert_AIZIJul 14, 2023, 3:31 PM
11 points
0 comments3 min readLW link
(aizi.substack.com)

Spec­u­la­tive in­fer­ences about path de­pen­dence in LLM su­per­vised fine-tun­ing from re­sults on lin­ear mode con­nec­tivity and model souping

RobertKirkJul 20, 2023, 9:56 AM
39 points
2 comments5 min readLW link

An­ti­ci­pa­tion in LLMs

derek shillerJul 24, 2023, 3:53 PM
6 points
0 comments13 min readLW link

AI Aware­ness through In­ter­ac­tion with Blatantly Alien Models

VojtaKovarikJul 28, 2023, 8:41 AM
7 points
5 comments3 min readLW link

[Linkpost] Mul­ti­modal Neu­rons in Pre­trained Text-Only Transformers

Bogdan Ionut CirsteaAug 4, 2023, 3:29 PM
11 points
0 comments1 min readLW link

[Linkpost] De­cep­tion Abil­ities Emerged in Large Lan­guage Models

Bogdan Ionut CirsteaAug 3, 2023, 5:28 PM
12 points
0 comments1 min readLW link

Re­searchers and writ­ers can ap­ply for proxy ac­cess to the GPT-3.5 base model (code-davinci-002)

ampdotDec 1, 2023, 6:48 PM
14 points
0 comments1 min readLW link
(airtable.com)

A Sim­ple The­ory Of Consciousness

SherlockHolmesAug 8, 2023, 6:05 PM
2 points
5 comments1 min readLW link
(peterholmes.medium.com)

In­flec­tion.ai is a ma­jor AGI lab

Nikola JurkovicAug 9, 2023, 1:05 AM
137 points
13 comments2 min readLW link

Ex­plor­ing the Mul­ti­verse of Large Lan­guage Models

frankyAug 6, 2023, 2:38 AM
1 point
0 comments5 min readLW link

Google Deep­Mind’s RT-2

SandXboxAug 11, 2023, 11:26 AM
9 points
1 comment1 min readLW link
(robotics-transformer2.github.io)

Co­her­ence Ther­apy with LLMs—quick demo

ChipmonkAug 14, 2023, 3:34 AM
19 points
11 comments1 min readLW link

[Question] Any re­search in “probe-tun­ing” of LLMs?

Roman LeventovAug 15, 2023, 9:01 PM
20 points
3 comments1 min readLW link

Memetic Judo #3: The In­tel­li­gence of Stochas­tic Par­rots v.2

Max TKAug 20, 2023, 3:18 PM
8 points
33 comments6 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

Dec 8, 2023, 5:08 PM
81 points
7 comments7 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

Aug 30, 2023, 5:36 PM
17 points
0 comments8 min readLW link
(arxiv.org)

Xanadu, GPT, and Beyond: An ad­ven­ture of the mind

Bill BenzonAug 27, 2023, 4:19 PM
2 points
0 comments5 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

Aug 29, 2023, 1:04 AM
77 points
4 comments1 min readLW link

Re­port on An­a­lyz­ing Con­no­ta­tion Frames in Evolv­ing Wikipe­dia Biographies

MairaAug 30, 2023, 10:02 PM
1 point
0 comments4 min readLW link

Can an LLM iden­tify ring-com­po­si­tion in a liter­ary text? [ChatGPT]

Bill BenzonSep 1, 2023, 2:18 PM
4 points
2 comments11 min readLW link

[Linkpost] Large lan­guage mod­els con­verge to­ward hu­man-like con­cept organization

Bogdan Ionut CirsteaSep 2, 2023, 6:00 AM
22 points
1 comment1 min readLW link

What must be the case that ChatGPT would have mem­o­rized “To be or not to be”? – Three kinds of con­cep­tual ob­jects for LLMs

Bill BenzonSep 3, 2023, 6:39 PM
19 points
0 comments12 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

Sep 6, 2023, 5:21 PM
105 points
3 comments2 min readLW link
(arxiv.org)

World, mind, and learn­abil­ity: A note on the meta­phys­i­cal struc­ture of the cos­mos [& LLMs]

Bill BenzonSep 5, 2023, 12:19 PM
4 points
1 comment5 min readLW link

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob DunefskySep 12, 2023, 5:38 PM
15 points
2 comments29 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

Sep 25, 2023, 5:19 PM
25 points
3 comments7 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

Sep 14, 2023, 1:40 AM
32 points
7 comments8 min readLW link
(far.ai)

[un­ti­tled post]

verwindungSep 14, 2023, 4:22 PM
1 point
0 comments1 min readLW link

Dis­cur­sive Com­pe­tence in ChatGPT, Part 2: Me­mory for Texts

Bill BenzonSep 28, 2023, 4:34 PM
1 point
0 comments3 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM
141 points
11 comments19 min readLW link

Image Hi­jacks: Ad­ver­sar­ial Images can Con­trol Gen­er­a­tive Models at Runtime

Sep 20, 2023, 3:23 PM
58 points
9 comments1 min readLW link
(arxiv.org)

Notes on ChatGPT’s “mem­ory” for strings and for events

Bill BenzonSep 20, 2023, 6:12 PM
3 points
0 comments10 min readLW link

A quick re­mark on so-called “hal­lu­ci­na­tions” in LLMs and hu­mans

Bill BenzonSep 23, 2023, 12:17 PM
4 points
4 comments1 min readLW link

Ex­pec­ta­tions for Gem­ini: hope­fully not a big deal

Maxime RichéOct 2, 2023, 3:38 PM
15 points
5 comments1 min readLW link

What would it mean to un­der­stand how a large lan­guage model (LLM) works? Some quick notes.

Bill BenzonOct 3, 2023, 3:11 PM
20 points
4 comments8 min readLW link

En­tan­gle­ment and in­tu­ition about words and mean­ing

Bill BenzonOct 4, 2023, 2:16 PM
4 points
0 comments2 min readLW link

[Question] What ev­i­dence is there of LLM’s con­tain­ing world mod­els?

Chris_LeongOct 4, 2023, 2:33 PM
17 points
17 comments1 min readLW link

Un­der­stand­ing LLMs: Some ba­sic ob­ser­va­tions about words, syn­tax, and dis­course [w/​ a con­jec­ture about grokking]

Bill BenzonOct 11, 2023, 7:13 PM
6 points
0 comments5 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

Oct 23, 2023, 2:11 PM
20 points
2 comments5 min readLW link
(far.ai)

Are (at least some) Large Lan­guage Models Holo­graphic Me­mory Stores?

Bill BenzonOct 20, 2023, 1:07 PM
11 points
4 comments6 min readLW link

ChatGPT tells 20 ver­sions of its pro­to­typ­i­cal story, with a short note on method

Bill BenzonOct 14, 2023, 3:27 PM
6 points
0 comments5 min readLW link

Map­ping ChatGPT’s on­tolog­i­cal land­scape, gra­di­ents and choices [in­ter­pretabil­ity]

Bill BenzonOct 15, 2023, 8:12 PM
1 point
0 comments18 min readLW link

ChatGPT Plays 20 Ques­tions [some­times needs help]

Bill BenzonOct 17, 2023, 5:30 PM
5 points
3 comments12 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogaoJun 2, 2021, 9:32 PM
82 points
11 comments17 min readLW link

[AN #144]: How lan­guage mod­els can also be fine­tuned for non-lan­guage tasks

Rohin ShahApr 2, 2021, 5:20 PM
19 points
0 comments6 min readLW link
(mailchi.mp)

How truth­ful is GPT-3? A bench­mark for lan­guage models

Owain_EvansSep 16, 2021, 10:09 AM
58 points
24 comments6 min readLW link

[Question] How does OpenAI’s lan­guage model af­fect our AI timeline es­ti­mates?

jimrandomhFeb 15, 2019, 3:11 AM
50 points
7 comments1 min readLW link

Build­ing AGI Us­ing Lan­guage Models

leogaoNov 9, 2020, 4:33 PM
11 points
1 comment1 min readLW link
(leogao.dev)

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya CotraMar 5, 2021, 10:29 PM
186 points
75 comments38 min readLW link1 review

The Codex Skep­tic FAQ

Michaël TrazziAug 24, 2021, 4:01 PM
49 points
24 comments2 min readLW link

On lan­guage mod­el­ing and fu­ture ab­stract rea­son­ing research

alexlyzhovMar 25, 2021, 5:43 PM
3 points
1 comment1 min readLW link
(docs.google.com)

Agen­tic Lan­guage Model Memes

FactorialCodeAug 1, 2020, 6:03 PM
16 points
1 comment2 min readLW link

[AN #164]: How well can lan­guage mod­els write code?

Rohin ShahSep 15, 2021, 5:20 PM
13 points
7 comments9 min readLW link
(mailchi.mp)

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

Rohin ShahAug 19, 2020, 5:10 PM
23 points
0 comments9 min readLW link
(mailchi.mp)

New GPT-3 competitor

Quintin PopeAug 12, 2021, 7:05 AM
32 points
10 comments1 min readLW link

OpenAI Codex: First Impressions

specbugAug 13, 2021, 4:52 PM
49 points
8 comments4 min readLW link
(sixeleven.in)

AMA on Truth­ful AI: Owen Cot­ton-Bar­ratt, Owain Evans & co-authors

Owain_EvansOct 22, 2021, 4:23 PM
31 points
15 comments1 min readLW link

Truth­ful and hon­est AI

Oct 29, 2021, 7:28 AM
42 points
1 comment13 min readLW link

larger lan­guage mod­els may dis­ap­point you [or, an eter­nally un­finished draft]

nostalgebraistNov 26, 2021, 11:08 PM
260 points
31 comments31 min readLW link2 reviews

Hard-Cod­ing Neu­ral Computation

MadHatterDec 13, 2021, 4:35 AM
34 points
8 comments27 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kittenDec 16, 2021, 10:41 PM
22 points
10 comments21 min readLW link

GPT-3: a dis­ap­point­ing paper

nostalgebraistMay 29, 2020, 7:06 PM
65 points
43 comments8 min readLW link1 review

A Sum­mary Of An­thropic’s First Paper

Sam RingerDec 30, 2021, 12:48 AM
85 points
1 comment8 min readLW link

How I’m think­ing about GPT-N

delton137Jan 17, 2022, 5:11 PM
54 points
21 comments18 min readLW link

Ex­trap­o­lat­ing GPT-N performance

Lukas FinnvedenDec 18, 2020, 9:41 PM
112 points
31 comments22 min readLW link1 review

2+2: On­tolog­i­cal Framework

LyrialtusFeb 1, 2022, 1:07 AM
−15 points
2 comments12 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogaoFeb 10, 2022, 6:56 AM
30 points
3 comments1 min readLW link
(eaidata.bmk.sh)

New GPT3 Im­pres­sive Ca­pa­bil­ities—In­struc­tGPT3 [1/​2]

simeon_cMar 13, 2022, 10:58 AM
72 points
10 comments7 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWangMar 29, 2022, 8:09 PM
72 points
4 comments6 min readLW link

My agenda for re­search into trans­former ca­pa­bil­ities—Introduction

p.b.Apr 5, 2022, 9:23 PM
11 points
1 comment3 min readLW link

Re­search agenda: Can trans­form­ers do sys­tem 2 think­ing?

p.b.Apr 6, 2022, 1:31 PM
20 points
0 comments2 min readLW link

PaLM in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas FinnvedenApr 6, 2022, 1:05 PM
85 points
19 comments2 min readLW link

Re­search agenda—Build­ing a multi-modal chess-lan­guage model

p.b.Apr 7, 2022, 12:25 PM
8 points
2 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_cApr 7, 2022, 1:46 PM
11 points
0 comments7 min readLW link

Elicit: Lan­guage Models as Re­search Assistants

Apr 9, 2022, 2:56 PM
71 points
6 comments13 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not RelevantApr 13, 2022, 2:02 AM
34 points
33 comments1 min readLW link

Why Copi­lot Ac­cel­er­ates Timelines

Michaël TrazziApr 26, 2022, 10:06 PM
35 points
14 comments7 min readLW link

A pos­si­ble check against mo­ti­vated rea­son­ing us­ing elicit.org

david reinsteinMay 18, 2022, 8:52 PM
3 points
0 comments1 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

May 25, 2022, 9:23 AM
114 points
17 comments12 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric DrexlerFeb 3, 2022, 3:20 PM
85 points
12 comments11 min readLW link1 review

Who mod­els the mod­els that model mod­els? An ex­plo­ra­tion of GPT-3′s in-con­text model fit­ting ability

LovreJun 7, 2022, 7:37 PM
112 points
16 comments9 min readLW link

[linkpost] The fi­nal AI bench­mark: BIG-bench

RomanSJun 10, 2022, 8:53 AM
25 points
21 comments1 min readLW link

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

Jun 14, 2022, 1:57 PM
28 points
6 comments13 min readLW link

Con­tra Hofs­tadter on GPT-3 Nonsense

ricticJun 15, 2022, 9:53 PM
237 points
24 comments2 min readLW link

Causal con­fu­sion as an ar­gu­ment against the scal­ing hypothesis

Jun 20, 2022, 10:54 AM
86 points
30 comments15 min readLW link

Yann LeCun, A Path Towards Au­tonomous Ma­chine In­tel­li­gence [link]

Bill BenzonJun 27, 2022, 11:29 PM
5 points
1 comment1 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

Jun 27, 2022, 3:58 PM
171 points
14 comments7 min readLW link

GPT-3 Catch­ing Fish in Morse Code

Megan KinnimentJun 30, 2022, 9:22 PM
117 points
27 comments8 min readLW link

Train­ing goals for large lan­guage models

Johannes TreutleinJul 18, 2022, 7:09 AM
28 points
5 comments19 min readLW link

Help ARC eval­u­ate ca­pa­bil­ities of cur­rent lan­guage mod­els (still need peo­ple)

Beth BarnesJul 19, 2022, 4:55 AM
95 points
6 comments2 min readLW link

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam JermynJul 21, 2022, 8:33 PM
18 points
4 comments8 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tameraAug 3, 2022, 12:03 PM
135 points
23 comments6 min readLW link

Trans­former lan­guage mod­els are do­ing some­thing more general

NumendilAug 3, 2022, 9:13 PM
53 points
6 comments2 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam JermynAug 17, 2022, 8:52 PM
38 points
9 comments4 min readLW link

Google AI in­te­grates PaLM with robotics: SayCan up­date [Linkpost]

Evan R. MurphyAug 24, 2022, 8:54 PM
25 points
0 comments1 min readLW link
(sites.research.google)

Is train­ing data go­ing to be diluted by AI-gen­er­ated con­tent?

Hannes ThurnherrSep 7, 2022, 6:13 PM
10 points
7 comments1 min readLW link

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo NardoSep 15, 2022, 5:54 PM
35 points
12 comments13 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmzSep 17, 2022, 3:55 AM
143 points
12 comments6 min readLW link1 review

[Question] If we have Hu­man-level chat­bots, won’t we end up be­ing ruled by pos­si­ble peo­ple?

Erlja Jkdf.Sep 20, 2022, 1:59 PM
5 points
13 comments1 min readLW link

An Un­ex­pected GPT-3 De­ci­sion in a Sim­ple Gam­ble

casualphysicsenjoyerSep 25, 2022, 4:46 PM
8 points
4 comments1 min readLW link

Re­call and Re­gur­gi­ta­tion in GPT2

Megan KinnimentOct 3, 2022, 7:35 PM
43 points
1 comment26 min readLW link

Brief Notes on Transformers

Adam JermynSep 26, 2022, 2:46 PM
48 points
3 comments2 min readLW link

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. MurphyOct 2, 2022, 1:29 AM
52 points
15 comments1 min readLW link
(openreview.net)

Smoke with­out fire is scary

Adam JermynOct 4, 2022, 9:08 PM
52 points
22 comments4 min readLW link

They gave LLMs ac­cess to physics simulators

ryan_bOct 17, 2022, 9:21 PM
50 points
18 comments1 min readLW link
(arxiv.org)

Is GPT-N bounded by hu­man ca­pa­bil­ities? No.

Cleo NardoOct 17, 2022, 11:26 PM
49 points
8 comments2 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John NayOct 21, 2022, 2:03 AM
5 points
18 comments54 min readLW link

What will the scaled up GATO look like? (Up­dated with ques­tions)

Amal Oct 25, 2022, 12:44 PM
34 points
22 comments1 min readLW link

[simu­la­tion] 4chan user claiming to be the at­tor­ney hired by Google’s sen­tient chat­bot LaMDA shares wild de­tails of encounter

janusNov 10, 2022, 9:39 PM
19 points
1 comment13 min readLW link
(generative.ink)

Hu­man-level Full-Press Di­plo­macy (some bare facts).

Cleo NardoNov 22, 2022, 8:59 PM
50 points
7 comments3 min readLW link

Gliders in Lan­guage Models

Alexandre VariengienNov 25, 2022, 12:38 AM
30 points
11 comments10 min readLW link

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

JozdienDec 2, 2022, 4:33 PM
44 points
8 comments5 min readLW link

[Question] Will the first AGI agent have been de­signed as an agent (in ad­di­tion to an AGI)?

nahojDec 3, 2022, 8:32 PM
1 point
8 comments1 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibsDec 5, 2022, 1:36 PM
20 points
11 comments2 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterushDec 6, 2022, 3:35 AM
19 points
9 comments1 min readLW link

Pro­saic mis­al­ign­ment from the Solomonoff Predictor

Cleo NardoDec 9, 2022, 5:53 PM
42 points
3 comments5 min readLW link

A brain­teaser for lan­guage models

Adam ScherlisDec 12, 2022, 2:43 AM
47 points
3 comments2 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam ScherlisDec 13, 2022, 12:46 AM
44 points
4 comments10 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

Dec 14, 2022, 2:33 PM
29 points
5 comments11 min readLW link

Prop­er­ties of cur­rent AIs and some pre­dic­tions of the evolu­tion of AI from the per­spec­tive of scale-free the­o­ries of agency and reg­u­la­tive development

Roman LeventovDec 20, 2022, 5:13 PM
33 points
3 comments36 min readLW link

Notes on Meta’s Di­plo­macy-Play­ing AI

Erich_GrunewaldDec 22, 2022, 11:34 AM
15 points
2 comments14 min readLW link
(www.erichgrunewald.com)

The Limit of Lan­guage Models

DragonGodJan 6, 2023, 11:53 PM
44 points
26 comments4 min readLW link

How evolu­tion­ary lineages of LLMs can plan their own fu­ture and act on these plans

Roman LeventovDec 25, 2022, 6:11 PM
39 points
16 comments8 min readLW link

Re­cent ad­vances in Nat­u­ral Lan­guage Pro­cess­ing—Some Woolly spec­u­la­tions (2019 es­say on se­man­tics and lan­guage mod­els)

philosophybearDec 27, 2022, 2:11 AM
1 point
0 comments7 min readLW link

Some Ar­gu­ments Against Strong Scaling

Joar SkalseJan 13, 2023, 12:04 PM
25 points
21 comments16 min readLW link

Large lan­guage mod­els can provide “nor­ma­tive as­sump­tions” for learn­ing hu­man preferences

Stuart_ArmstrongJan 2, 2023, 7:39 PM
29 points
12 comments3 min readLW link

MAKE IT BETTER (a po­etic demon­stra­tion of the ba­nal­ity of GPT-3)

rogersbaconJan 2, 2023, 8:47 PM
7 points
2 comments5 min readLW link

On the nat­u­ral­is­tic study of the lin­guis­tic be­hav­ior of ar­tifi­cial intelligence

Bill BenzonJan 3, 2023, 9:06 AM
1 point
0 comments4 min readLW link

Whisper’s Wild Implications

Ollie JJan 3, 2023, 12:17 PM
19 points
6 comments5 min readLW link

Spec­u­la­tion on Path-Depen­dance in Large Lan­guage Models.

NickyPJan 15, 2023, 8:42 PM
16 points
2 comments7 min readLW link

Cri­tique of some re­cent philos­o­phy of LLMs’ minds

Roman LeventovJan 20, 2023, 12:53 PM
52 points
8 comments20 min readLW link

Emo­tional at­tach­ment to AIs opens doors to problems

Igor IvanovJan 22, 2023, 8:28 PM
20 points
10 comments4 min readLW link

ChatGPT in­ti­mates a tan­ta­l­iz­ing fu­ture; its core LLM is or­ga­nized on mul­ti­ple lev­els; and it has bro­ken the idea of think­ing.

Bill BenzonJan 24, 2023, 7:05 PM
5 points
0 comments5 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam ScherlisJan 31, 2023, 8:33 AM
84 points
12 comments4 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob PfauFeb 6, 2023, 8:45 PM
29 points
6 comments3 min readLW link

Two very differ­ent ex­pe­riences with ChatGPT

SherrinfordFeb 7, 2023, 1:09 PM
38 points
15 comments5 min readLW link

On The Cur­rent Sta­tus Of AI Dating

Nikita BrancatisanoFeb 7, 2023, 8:00 PM
52 points
8 comments6 min readLW link

A note on ‘semiotic physics’

metasemiFeb 11, 2023, 5:12 AM
11 points
13 comments6 min readLW link

A poem co-writ­ten by ChatGPT

SherrinfordFeb 16, 2023, 10:17 AM
13 points
0 comments7 min readLW link

Pow­er­ful mesa-op­ti­mi­sa­tion is already here

Roman LeventovFeb 17, 2023, 4:59 AM
35 points
1 comment2 min readLW link
(arxiv.org)

Bing chat is the AI fire alarm

RatiosFeb 17, 2023, 6:51 AM
115 points
63 comments3 min readLW link

Microsoft and OpenAI, stop tel­ling chat­bots to role­play as AI

hold_my_fishFeb 17, 2023, 7:55 PM
49 points
10 comments1 min readLW link

GPT-4 Predictions

Stephen McAleeseFeb 17, 2023, 11:20 PM
110 points
27 comments11 min readLW link

Stop post­ing prompt in­jec­tions on Twit­ter and call­ing it “mis­al­ign­ment”

lcFeb 19, 2023, 2:21 AM
144 points
9 comments1 min readLW link

Syd­ney the Bin­gena­tor Can’t Think, But It Still Threat­ens People

Valentin BaltadzhievFeb 20, 2023, 6:37 PM
−3 points
2 comments8 min readLW link

The idea that ChatGPT is sim­ply “pre­dict­ing” the next word is, at best, misleading

Bill BenzonFeb 20, 2023, 11:32 AM
55 points
88 comments5 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points
20 comments11 min readLW link2 reviews

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

GiulioFeb 21, 2023, 11:44 AM
12 points
0 comments1 min readLW link
(arxiv.org)

[Question] In­ject­ing noise to GPT to get mul­ti­ple answers

bipoloFeb 22, 2023, 8:02 PM
1 point
1 comment1 min readLW link

Reflec­tion Mechanisms as an Align­ment Tar­get—At­ti­tudes on “near-term” AI

Mar 2, 2023, 4:29 AM
21 points
0 comments8 min readLW link

Si­tu­a­tional aware­ness in Large Lan­guage Models

Simon MöllerMar 3, 2023, 6:59 PM
31 points
2 comments7 min readLW link

The View from 30,000 Feet: Pre­face to the Se­cond EleutherAI Retrospective

Mar 7, 2023, 4:22 PM
14 points
0 comments4 min readLW link
(blog.eleuther.ai)

Against LLM Reductionism

Erich_GrunewaldMar 8, 2023, 3:52 PM
140 points
17 comments18 min readLW link
(www.erichgrunewald.com)

Stop call­ing it “jailbreak­ing” ChatGPT

TemplarrrMar 10, 2023, 11:41 AM
7 points
9 comments2 min readLW link

The is­sue of mean­ing in large lan­guage mod­els (LLMs)

Bill BenzonMar 11, 2023, 11:00 PM
1 point
34 comments8 min readLW link

ChatGPT (and now GPT4) is very eas­ily dis­tracted from its rules

dmcsMar 15, 2023, 5:55 PM
180 points
42 comments1 min readLW link

Grad­ual take­off, fast failure

Max HMar 16, 2023, 10:02 PM
15 points
4 comments5 min readLW link

[Question] Are nested jailbreaks in­evitable?

judsonMar 17, 2023, 5:43 PM
1 point
0 comments1 min readLW link

GPTs’ abil­ity to keep a se­cret is weirdly prompt-dependent

Jul 22, 2023, 12:21 PM
31 points
0 comments9 min readLW link

In­stan­ti­at­ing an agent with GPT-4 and text-davinci-003

Max HMar 19, 2023, 11:57 PM
13 points
3 comments32 min readLW link

[Question] Would it be use­ful to col­lect the con­texts, where var­i­ous LLMs think the same?

Martin VlachAug 24, 2023, 10:01 PM
6 points
1 comment1 min readLW link

Cat­e­gor­i­cal Or­ga­ni­za­tion in Me­mory: ChatGPT Or­ga­nizes the 665 Topic Tags from My New Sa­vanna Blog

Bill BenzonDec 14, 2023, 1:02 PM
0 points
6 comments2 min readLW link

A vi­sual anal­ogy for text gen­er­a­tion by LLMs?

Bill BenzonDec 16, 2023, 5:58 PM
3 points
0 comments1 min readLW link

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Dec 18, 2023, 11:58 AM
147 points
21 comments10 min readLW link

Lifel­og­ging for Align­ment & Immortality

Dev.ErrataAug 17, 2024, 11:42 PM
13 points
3 comments7 min readLW link

Ap­proach­ing Hu­man-Level Fore­cast­ing with Lan­guage Models

Feb 29, 2024, 10:36 PM
60 points
6 comments3 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points
0 comments12 min readLW link

Ex­plor­ing the Resi­d­ual Stream of Trans­form­ers for Mechanis­tic In­ter­pretabil­ity — Explained

Zeping YuDec 26, 2023, 12:36 AM
7 points
1 comment11 min readLW link

The fu­ture of Hu­mans: Oper­a­tors of AI

François-Joseph LacroixDec 30, 2023, 11:46 PM
1 point
0 comments1 min readLW link
(medium.com)

Does ChatGPT know what a tragedy is?

Bill BenzonDec 31, 2023, 7:10 AM
2 points
4 comments5 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM
5 points
2 comments10 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM
8 points
2 comments10 min readLW link
(www.lesswrong.com)

An­nounc­ing the Dou­ble Crux Bot

Jan 9, 2024, 6:54 PM
53 points
10 comments3 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
83 points
9 comments18 min readLW link

Just be­cause an LLM said it doesn’t mean it’s true: an illus­tra­tive example

dirkAug 21, 2024, 9:05 PM
26 points
12 comments3 min readLW link

What’s go­ing on with Per-Com­po­nent Weight Up­dates?

4gateAug 22, 2024, 9:22 PM
1 point
0 comments6 min readLW link

Maybe talk­ing isn’t the best way to com­mu­ni­cate with LLMs

mnvrJan 17, 2024, 6:24 AM
3 points
1 comment1 min readLW link
(mrmr.io)

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:30 AM
1 point
0 comments1 min readLW link

In­terLab – a toolkit for ex­per­i­ments with multi-agent interactions

Jan 22, 2024, 6:23 PM
69 points
0 comments8 min readLW link
(acsresearch.org)

Pre­dict­ing AGI by the Tur­ing Test

Yuxi_LiuJan 22, 2024, 4:22 AM
21 points
2 comments10 min readLW link
(yuxi-liu-wired.github.io)

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM
1 point
0 comments1 min readLW link

RAND re­port finds no effect of cur­rent LLMs on vi­a­bil­ity of bioter­ror­ism attacks

StellaAthenaJan 25, 2024, 7:17 PM
94 points
14 comments1 min readLW link
(www.rand.org)

Put­ting mul­ti­modal LLMs to the Tetris test

Feb 1, 2024, 4:02 PM
30 points
5 comments7 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

Aug 22, 2024, 7:32 AM
23 points
1 comment21 min readLW link

Why I take short timelines seriously

NicholasKeesJan 28, 2024, 10:27 PM
122 points
29 comments4 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
68 points
0 comments14 min readLW link

The case for more am­bi­tious lan­guage model evals

JozdienJan 30, 2024, 12:01 AM
112 points
30 comments5 min readLW link

Look­ing be­yond Everett in mul­ti­ver­sal views of LLMs

kromemMay 29, 2024, 12:35 PM
10 points
0 comments8 min readLW link

In­duc­ing hu­man-like bi­ases in moral rea­son­ing LMs

Feb 20, 2024, 4:28 PM
23 points
3 comments14 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points
4 comments8 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points
12 comments31 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
88 points
14 comments9 min readLW link
(arxiv.org)

What’s ChatGPT’s Fa­vorite Ice Cream Fla­vor? An In­ves­ti­ga­tion Into Syn­thetic Respondents

Greg RobisonFeb 9, 2024, 6:38 PM
19 points
4 comments15 min readLW link

The Last Laugh: Ex­plor­ing the Role of Hu­mor as a Bench­mark for Large Lan­guage Models

Greg RobisonFeb 12, 2024, 6:34 PM
4 points
6 comments11 min readLW link

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Miles TurpinMar 11, 2024, 11:46 PM
16 points
0 comments1 min readLW link
(arxiv.org)

[Question] What ex­per­i­ment set­tles the Gary Mar­cus vs Ge­offrey Hin­ton de­bate?

Valentin BaltadzhievFeb 14, 2024, 9:06 AM
12 points
8 comments1 min readLW link

[Question] Can any LLM be rep­re­sented as an Equa­tion?

Valentin BaltadzhievMar 14, 2024, 9:51 AM
1 point
2 comments1 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneillMar 24, 2024, 8:05 PM
28 points
4 comments24 min readLW link

Phal­lo­cen­tric­ity in GPT-J’s bizarre strat­ified ontology

mwatkinsFeb 17, 2024, 12:16 AM
50 points
37 comments9 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM
158 points
87 comments12 min readLW link

Lan­guage Models Don’t Learn the Phys­i­cal Man­i­fes­ta­tion of Language

Feb 22, 2024, 6:52 PM
39 points
23 comments1 min readLW link
(arxiv.org)

The role of philo­soph­i­cal think­ing in un­der­stand­ing large lan­guage mod­els: Cal­ibrat­ing and clos­ing the gap be­tween first-per­son ex­pe­rience and un­der­ly­ing mechanisms

Bill BenzonFeb 23, 2024, 12:19 PM
4 points
0 comments10 min readLW link

In­stru­men­tal de­cep­tion and ma­nipu­la­tion in LLMs—a case study

Olli JärviniemiFeb 24, 2024, 2:07 AM
39 points
13 comments12 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

Mar 15, 2024, 11:16 PM
90 points
0 comments1 min readLW link
(metr.github.io)

XAI re­leases Grok base model

Jacob G-WMar 18, 2024, 12:47 AM
11 points
3 comments1 min readLW link
(x.ai)

Sparse Au­toen­coder Fea­tures for Clas­sifi­ca­tions and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM
5 points
0 comments1 min readLW link
(arxiv.org)

New LLM Scal­ing Law

wrmedfordFeb 19, 2025, 8:21 PM
2 points
0 comments1 min readLW link
(github.com)

The In­for­ma­tion: OpenAI shows ‘Straw­berry’ to feds, races to launch it

Martín SotoAug 27, 2024, 11:10 PM
145 points
15 comments3 min readLW link

[Question] Could LLMs Help Gen­er­ate New Con­cepts in Hu­man Lan­guage?

Pekka LampeltoMar 24, 2024, 8:13 PM
10 points
4 comments2 min readLW link

Your LLM Judge may be biased

Mar 29, 2024, 4:39 PM
37 points
9 comments6 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM
7 points
2 comments31 min readLW link

Lan­guage and Ca­pa­bil­ities: Test­ing LLM Math­e­mat­i­cal Abil­ities Across Languages

Ethan EdwardsApr 4, 2024, 1:18 PM
24 points
2 comments36 min readLW link

End-to-end hack­ing with lan­guage models

tchauvinApr 5, 2024, 3:06 PM
29 points
0 comments8 min readLW link

[Question] Is LLM Trans­la­tion Without Rosetta Stone pos­si­ble?

cubefoxApr 11, 2024, 12:36 AM
14 points
14 comments1 min readLW link

Is Wittgen­stein’s Lan­guage Game used when helping Ai un­der­stand lan­guage?

VisionaryHeraJun 4, 2024, 7:41 AM
3 points
7 comments1 min readLW link

Can Large Lan­guage Models effec­tively iden­tify cy­ber­se­cu­rity risks?

emile delcourtAug 30, 2024, 8:20 PM
18 points
0 comments11 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM
2 points
8 comments6 min readLW link

[Question] Bar­cod­ing LLM Train­ing Data Sub­sets. Any­one try­ing this for in­ter­pretabil­ity?

right..enough?Apr 13, 2024, 3:09 AM
7 points
0 comments7 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM
29 points
7 comments12 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh LevyJun 4, 2024, 3:45 PM
39 points
0 comments18 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points
7 comments16 min readLW link

How LLMs Work, in the Style of The Economist

utilistrutilApr 22, 2024, 7:06 PM
0 points
0 comments2 min readLW link

At last! ChatGPT does, shall we say, in­ter­est­ing imi­ta­tions of “Kubla Khan”

Bill BenzonApr 24, 2024, 2:56 PM
−3 points
0 comments4 min readLW link

Re­dun­dant At­ten­tion Heads in Large Lan­guage Models For In Con­text Learning

skunnavakkamSep 1, 2024, 8:08 PM
7 points
1 comment4 min readLW link
(skunnavakkam.github.io)

LLMs seem (rel­a­tively) safe

JustisMillsApr 25, 2024, 10:13 PM
53 points
24 comments7 min readLW link
(justismills.substack.com)

An in­ter­est­ing math­e­mat­i­cal model of how LLMs work

Bill BenzonApr 30, 2024, 11:01 AM
5 points
0 comments1 min readLW link

LLMs could be as con­scious as hu­man em­u­la­tions, potentially

CanalettoApr 30, 2024, 11:36 AM
15 points
15 comments3 min readLW link

On pre­cise out-of-con­text steering

Olli JärviniemiMay 3, 2024, 9:41 AM
9 points
6 comments3 min readLW link

Re­la­tion­ships among words, met­al­in­gual defi­ni­tion, and interpretability

Bill BenzonJun 7, 2024, 7:18 PM
2 points
0 comments5 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

May 6, 2024, 7:07 AM
95 points
13 comments1 min readLW link
(arxiv.org)

If lan­guage is for com­mu­ni­ca­tion, what does that im­ply about LLMs?

Bill BenzonMay 12, 2024, 2:55 AM
10 points
0 comments1 min readLW link

Lan­guage Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM
158 points
55 comments7 min readLW link

The In­tel­li­gent Meme Machine

Daniel DiSistoJun 14, 2024, 2:26 PM
1 point
0 comments6 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points
0 comments7 min readLW link
(arxiv.org)

Lam­ini’s Tar­geted Hal­lu­ci­na­tion Re­duc­tion May Be a Big Deal for Job Automation

sweenesmJun 18, 2024, 3:29 PM
3 points
0 comments1 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Jun 21, 2024, 3:54 PM
163 points
13 comments8 min readLW link
(arxiv.org)

LLM Gen­er­al­ity is a Timeline Crux

eggsyntaxJun 24, 2024, 12:52 PM
217 points
119 comments7 min readLW link

Work­shop: In­ter­pretabil­ity in LLMs us­ing Geo­met­ric and Statis­ti­cal Methods

Karthik ViswanathanFeb 22, 2025, 9:39 AM
12 points
0 comments2 min readLW link

Live The­ory Part 0: Tak­ing In­tel­li­gence Seriously

SahilJun 26, 2024, 9:37 PM
103 points
3 comments8 min readLW link

Check­ing pub­lic figures on whether they “an­swered the ques­tion” quick anal­y­sis from Har­ris/​Trump de­bate, and a proposal

david reinsteinSep 11, 2024, 8:25 PM
7 points
4 comments1 min readLW link
(open.substack.com)

Keep­ing con­tent out of LLM train­ing datasets

Ben MillwoodJul 18, 2024, 10:27 AM
3 points
0 comments5 min readLW link

[Question] Should we ex­clude al­ign­ment re­search from LLM train­ing datasets?

Ben MillwoodJul 18, 2024, 10:27 AM
3 points
5 comments1 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
66 points
0 comments10 min readLW link

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart BuergerJul 19, 2024, 2:07 PM
24 points
3 comments2 min readLW link
(arxiv.org)

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM
118 points
19 comments12 min readLW link

An ex­per­i­ment on hid­den cognition

Olli JärviniemiJul 22, 2024, 3:26 AM
25 points
2 comments7 min readLW link

Does ro­bust­ness im­prove with scale?

Jul 25, 2024, 8:55 PM
14 points
0 comments1 min readLW link
(far.ai)

A short cri­tique of Omo­hun­dro’s “Ba­sic AI Drives”

Soumyadeep BoseDec 19, 2024, 7:19 PM
6 points
0 comments4 min readLW link

[Question] How to­k­eniza­tion in­fluences prompt­ing?

Boris KashirinJul 29, 2024, 10:28 AM
9 points
4 comments1 min readLW link

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

Jul 30, 2024, 3:41 PM
32 points
0 comments15 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

Jul 30, 2024, 9:11 PM
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

[Question] Have LLMs Gen­er­ated Novel In­sights?

Feb 23, 2025, 6:22 PM
154 points
36 comments2 min readLW link

Us­ing ide­olog­i­cally-charged lan­guage to get gpt-3.5-turbo to di­s­obey it’s sys­tem prompt: a demo

Milan WAug 24, 2024, 12:13 AM
3 points
0 comments6 min readLW link

LLMs stifle cre­ativity, elimi­nate op­por­tu­ni­ties for serendipi­tous dis­cov­ery and dis­rupt in­ter­gen­er­a­tional trans­fer of wisdom

GhdzAug 5, 2024, 6:27 PM
6 points
2 comments7 min readLW link

GPT-2 Some­times Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM
13 points
0 comments2 min readLW link
(ronakrm.github.io)

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn DheudndAug 14, 2024, 10:19 PM
−4 points
2 comments4 min readLW link

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

Sep 25, 2024, 2:52 PM
36 points
2 comments4 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

Sep 26, 2024, 1:44 PM
42 points
4 comments1 min readLW link
(arxiv.org)

Self lo­ca­tion for LLMs by LLMs: Self-Assess­ment Check­list.

CanalettoSep 26, 2024, 7:57 PM
11 points
0 comments5 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

Sep 27, 2024, 5:49 PM
59 points
10 comments4 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido BergmanSep 27, 2024, 5:49 PM
7 points
2 comments9 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezengaSep 28, 2024, 6:29 PM
9 points
0 comments9 min readLW link

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezengaSep 28, 2024, 7:02 PM
2 points
2 comments6 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
60 points
20 comments10 min readLW link

In-Con­text Learn­ing: An Align­ment Survey

alamertonSep 30, 2024, 6:44 PM
8 points
0 comments20 min readLW link
(docs.google.com)

Tech­ni­cal com­par­i­son of Deepseek, No­vasky, S1, Helix, P0

JuliezhangggFeb 25, 2025, 4:20 AM
8 points
0 comments5 min readLW link

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk LimOct 3, 2024, 6:04 PM
5 points
0 comments8 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points
0 comments18 min readLW link

[Question] Re­in­force­ment Learn­ing: Essen­tial Step Towards AGI or Ir­rele­vant?

DoubleOct 17, 2024, 3:37 AM
1 point
0 comments1 min readLW link

[PAPER] Ja­co­bian Sparse Au­toen­coders: Spar­sify Com­pu­ta­tions, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM
79 points
8 comments7 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk LimOct 21, 2024, 9:34 PM
4 points
0 comments3 min readLW link

SAE Train­ing Dataset In­fluence in Fea­ture Match­ing and a Hy­poth­e­sis on Po­si­tion Features

Seonglae ChoFeb 26, 2025, 5:05 PM
3 points
3 comments17 min readLW link

Meta AI (FAIR) lat­est pa­per in­te­grates sys­tem-1 and sys­tem-2 think­ing into rea­son­ing mod­els.

happy fridayOct 24, 2024, 4:54 PM
8 points
0 comments1 min readLW link

Retrieval Aug­mented Genesis

João Ribeiro MedeirosOct 1, 2024, 8:18 PM
6 points
0 comments29 min readLW link

Retrieval Aug­mented Ge­n­e­sis II — Holy Texts Se­man­tics Analysis

João Ribeiro MedeirosOct 26, 2024, 5:00 PM
−1 points
0 comments11 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
47 points
4 comments5 min readLW link

Ed­u­ca­tional CAI: Align­ing a Lan­guage Model with Ped­a­gog­i­cal Theories

Bharath PuranamNov 1, 2024, 6:55 PM
5 points
1 comment13 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

Nov 1, 2024, 12:10 AM
18 points
0 comments6 min readLW link
(far.ai)

Cur­rent safety train­ing tech­niques do not fully trans­fer to the agent setting

Nov 3, 2024, 7:24 PM
158 points
9 comments5 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
66 points
4 comments14 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

Nov 7, 2024, 10:07 PM
47 points
0 comments1 min readLW link
(arxiv.org)

LLMs Look In­creas­ingly Like Gen­eral Reasoners

eggsyntaxNov 8, 2024, 11:47 PM
93 points
45 comments3 min readLW link

La­tent Space Col­lapse? Un­der­stand­ing the Effects of Nar­row Fine-Tun­ing on LLMs

tenseisohamFeb 28, 2025, 8:22 PM
3 points
0 comments9 min readLW link

Sparks of Consciousness

Charlie SandersNov 13, 2024, 4:58 AM
2 points
0 comments3 min readLW link
(www.dailymicrofiction.com)

Which AI Safety Bench­mark Do We Need Most in 2025?

Nov 17, 2024, 11:50 PM
2 points
2 comments8 min readLW link

[Question] Why is Gem­ini tel­ling the user to die?

BurnyNov 18, 2024, 1:44 AM
13 points
1 comment1 min readLW link

I, Token

Ivan VendrovNov 25, 2024, 2:20 AM
14 points
2 comments3 min readLW link
(nothinghuman.substack.com)

De­pres­sion and Creativity

Bill BenzonNov 29, 2024, 12:27 AM
−4 points
0 comments6 min readLW link

Two in­ter­views with the founder of DeepSeek

Cosmia_NebulaNov 29, 2024, 3:18 AM
50 points
6 comments31 min readLW link
(rentry.co)

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

Dec 7, 2024, 6:10 PM
68 points
0 comments12 min readLW link

The Po­lite Coup

Charlie SandersDec 4, 2024, 2:03 PM
3 points
0 comments3 min readLW link
(www.dailymicrofiction.com)

Distil­la­tion of Meta’s Large Con­cept Models Paper

NickyPMar 4, 2025, 5:33 PM
19 points
3 comments4 min readLW link

Fa­vorite col­ors of some LLMs.

CanalettoDec 31, 2024, 9:22 PM
10 points
3 comments7 min readLW link

Thoughts about what kinds of virtues are rele­vant in con­text of LLMs.

CanalettoMar 8, 2025, 7:02 PM
1 point
0 comments10 min readLW link

The Lan­guage Bot­tle­neck in AI Rea­son­ing: Are We For­get­ting to Think?

WotakerMar 8, 2025, 1:44 PM
1 point
0 comments7 min readLW link

How Lan­guage Models Un­der­stand Nullability

Mar 11, 2025, 3:57 PM
5 points
0 comments2 min readLW link
(dmodel.ai)

[Question] Is “hid­den com­plex­ity of wishes prob­lem” solved?

Roman MalovJan 5, 2025, 10:59 PM
10 points
4 comments1 min readLW link

Gen­er­at­ing Cog­nate­ful Sen­tences with Large Lan­guage Models

vkethanaJan 6, 2025, 6:40 PM
8 points
0 comments10 min readLW link

In­de­pen­dent re­search ar­ti­cle an­a­lyz­ing con­sis­tent self-re­ports of ex­pe­rience in ChatGPT and Claude

rifeJan 6, 2025, 5:34 PM
4 points
20 comments1 min readLW link
(awakenmoon.ai)

False Pos­i­tives in En­tity-Level Hal­lu­ci­na­tion De­tec­tion: A Tech­ni­cal Challenge

MaxKamacheeJan 14, 2025, 7:22 PM
1 point
0 comments2 min readLW link

[Question] Where should one post to get into the train­ing data?

keltanJan 15, 2025, 12:41 AM
11 points
5 comments1 min readLW link

A Novel Emer­gence of Meta-Aware­ness in LLM Fine-Tuning

rifeJan 15, 2025, 10:59 PM
54 points
31 comments2 min readLW link

Wor­ries about la­tent rea­son­ing in LLMs

CBiddulphJan 20, 2025, 9:09 AM
42 points
3 comments7 min readLW link

In­trin­sic Di­men­sion of Prompts in LLMs

Karthik ViswanathanFeb 14, 2025, 7:02 PM
3 points
0 comments4 min readLW link

Up­dat­ing and Edit­ing Fac­tual Knowl­edge in Lan­guage Models

Dhananjay AshokJan 23, 2025, 7:34 PM
2 points
2 comments10 min readLW link

Lo­cat­ing and Edit­ing Knowl­edge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM
1 point
0 comments4 min readLW link

Ano­ma­lous To­kens in Deep­Seek-V3 and r1

henryJan 25, 2025, 10:55 PM
136 points
2 comments7 min readLW link

The many failure modes of con­sumer-grade LLMs

dereshevJan 26, 2025, 7:01 PM
2 points
0 comments8 min readLW link

AI Model His­tory is Be­ing Lost

ValeMar 16, 2025, 12:38 PM
19 points
1 comment1 min readLW link
(vale.rocks)

De­tect­ing out of dis­tri­bu­tion text with sur­prisal and entropy

Sandy FraserJan 28, 2025, 6:46 PM
14 points
4 comments11 min readLW link

Notable run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion format

Mar 16, 2025, 11:23 PM
36 points
6 comments7 min readLW link

How I force LLMs to gen­er­ate cor­rect code

claudioMar 21, 2025, 2:40 PM
84 points
7 comments5 min readLW link

[Question] Would it be effec­tive to learn a lan­guage to im­prove cog­ni­tion?

HrussMar 26, 2025, 10:17 AM
8 points
7 comments1 min readLW link

Edge Cases in AI Alignment

Florian_DietzMar 24, 2025, 9:27 AM
19 points
3 comments4 min readLW link

Can 7B-8B LLMs judge their own home­work?

dereshevFeb 1, 2025, 8:29 AM
1 point
0 comments4 min readLW link

Pos­i­tive jailbreaks in LLMs

dereshevJan 29, 2025, 8:41 AM
6 points
0 comments4 min readLW link

ChatGPT: Ex­plor­ing the Digi­tal Wilder­ness, Find­ings and Prospects

Bill BenzonFeb 2, 2025, 9:54 AM
2 points
0 comments5 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan HenselmansFeb 3, 2025, 7:35 PM
15 points
7 comments6 min readLW link

Mir­ror Thinking

C.M. AurinMar 24, 2025, 3:34 PM
1 point
0 comments6 min readLW link

Utili­tar­ian AI Align­ment: Build­ing a Mo­ral As­sis­tant with the Con­sti­tu­tional AI Method

Clément LFeb 4, 2025, 4:15 AM
6 points
1 comment13 min readLW link

Post-hoc rea­son­ing in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM
10 points
0 comments11 min readLW link

Deep­Seek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM
12 points
0 comments8 min readLW link

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM
1 point
0 comments3 min readLW link

My model of what is go­ing on with LLMs

Cole WyethFeb 13, 2025, 3:43 AM
98 points
49 comments7 min readLW link

Emer­gent Analog­i­cal Rea­son­ing in Large Lan­guage Models

Roman LeventovMar 22, 2023, 5:18 AM
13 points
2 comments1 min readLW link
(arxiv.org)

Does GPT-4 ex­hibit agency when sum­ma­riz­ing ar­ti­cles?

Christopher KingMar 24, 2023, 3:49 PM
16 points
2 comments5 min readLW link

More ex­per­i­ments in GPT-4 agency: writ­ing memos

Christopher KingMar 24, 2023, 5:51 PM
5 points
2 comments10 min readLW link

GPT-4 al­ign­ing with aca­sual de­ci­sion the­ory when in­structed to play games, but in­cludes a CDT ex­pla­na­tion that’s in­cor­rect if they differ

Christopher KingMar 23, 2023, 4:16 PM
7 points
4 comments8 min readLW link

Hut­ter-Prize for Prompts

rokosbasiliskMar 24, 2023, 9:26 PM
5 points
10 comments1 min readLW link

If it quacks like a duck...

RationalMindsetMar 26, 2023, 6:54 PM
−4 points
0 comments4 min readLW link

Chronos­ta­sis: The Time-Cap­sule Co­nun­drum of Lan­guage Models

RationalMindsetMar 26, 2023, 6:54 PM
−5 points
0 comments1 min readLW link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa OsiboduMar 26, 2023, 6:56 PM
−2 points
0 comments2 min readLW link
(www.researchgate.net)

the ten­sor is a lonely place

jml6Mar 27, 2023, 6:22 PM
−11 points
0 comments4 min readLW link
(ekjsgrjelrbno.substack.com)

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Peter HroššoMar 27, 2023, 2:36 PM
13 points
7 comments1 min readLW link

GPT-4 is bad at strate­gic thinking

Christopher KingMar 27, 2023, 3:11 PM
22 points
8 comments1 min readLW link

The Prospect of an AI Winter

Erich_GrunewaldMar 27, 2023, 8:55 PM
62 points
24 comments15 min readLW link
(www.erichgrunewald.com)

Adapt­ing to Change: Over­com­ing Chronos­ta­sis in AI Lan­guage Models

RationalMindsetMar 28, 2023, 2:32 PM
−1 points
0 comments6 min readLW link

Why I Think the Cur­rent Tra­jec­tory of AI Re­search has Low P(doom) - LLMs

GaPaApr 1, 2023, 8:35 PM
2 points
1 comment10 min readLW link

The Quan­ti­za­tion Model of Neu­ral Scaling

nzMar 31, 2023, 4:02 PM
17 points
0 comments1 min readLW link
(arxiv.org)

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM
6 points
4 comments4 min readLW link

Imag­ine a world where Microsoft em­ploy­ees used Bing

Christopher KingMar 31, 2023, 6:36 PM
6 points
2 comments2 min readLW link

AI Safety via Luck

JozdienApr 1, 2023, 8:13 PM
81 points
7 comments11 min readLW link

[Question] Where to be­gin in ML/​AI?

Jake the StudentApr 6, 2023, 8:45 PM
9 points
4 comments1 min readLW link

Con­tra LeCun on “Au­tore­gres­sive LLMs are doomed”

rotatingpaguroApr 10, 2023, 4:05 AM
20 points
20 comments8 min readLW link

LW is prob­a­bly not the place for “I asked this LLM (x) and here’s what it said!”, but where is?

lillybaeumApr 12, 2023, 10:12 AM
21 points
3 comments1 min readLW link

[Question] Goals of model vs. goals of simu­lacra?

dr_sApr 12, 2023, 1:02 PM
5 points
7 comments1 min readLW link

Nat­u­ral lan­guage alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM
31 points
2 comments2 min readLW link

Was Homer a stochas­tic par­rot? Mean­ing in liter­ary texts and LLMs

Bill BenzonApr 13, 2023, 4:44 PM
7 points
4 comments3 min readLW link

LLMs and hal­lu­ci­na­tion, like white on rice?

Bill BenzonApr 14, 2023, 7:53 PM
5 points
0 comments3 min readLW link

The Soul of the Writer (on LLMs, the psy­chol­ogy of writ­ers, and the na­ture of in­tel­li­gence)

rogersbaconApr 16, 2023, 4:02 PM
11 points
1 comment3 min readLW link
(www.secretorum.life)

No, re­ally, it pre­dicts next to­kens.

simonApr 18, 2023, 3:47 AM
58 points
55 comments3 min readLW link

An al­ter­na­tive of PPO to­wards alignment

ml hkustApr 17, 2023, 5:58 PM
2 points
2 comments4 min readLW link

A poem writ­ten by a fancy autocomplete

Christopher KingApr 20, 2023, 2:31 AM
1 point
0 comments1 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher KingApr 20, 2023, 7:57 PM
2 points
7 comments3 min readLW link

Read­abil­ity is mostly a waste of characters

vlad.proexApr 21, 2023, 10:05 PM
21 points
7 comments3 min readLW link

[Question] Could trans­former net­work mod­els learn mo­tor plan­ning like they can learn lan­guage and image gen­er­a­tion?

mu_(negative)Apr 23, 2023, 5:24 PM
2 points
4 comments1 min readLW link

A re­sponse to Con­jec­ture’s CoEm proposal

Kristian FreedApr 24, 2023, 5:23 PM
7 points
0 comments4 min readLW link

Im­ple­ment­ing a Trans­former from scratch in PyTorch—a write-up on my experience

Mislav JurićApr 25, 2023, 8:51 PM
20 points
0 comments10 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob PfauApr 26, 2023, 10:53 PM
16 points
2 comments2 min readLW link

Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

Oct 23, 2023, 4:33 PM
33 points
2 comments11 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek KorbakOct 25, 2023, 12:17 PM
18 points
2 comments5 min readLW link

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

Nov 1, 2023, 12:46 PM
18 points
1 comment7 min readLW link

ChatGPT’s On­tolog­i­cal Land­scape

Bill BenzonNov 1, 2023, 3:12 PM
7 points
0 comments4 min readLW link

What are the limits of su­per­in­tel­li­gence?

rainyApr 27, 2023, 6:29 PM
4 points
3 comments5 min readLW link

Pre­face to the Se­quence on LLM Psychology

Quentin FEUILLADE--MONTIXINov 7, 2023, 4:12 PM
33 points
0 comments2 min readLW link

Many Com­mon Prob­lems are NP-Hard, and Why that Mat­ters for AI

Andrew Keenan RichardsonMar 26, 2025, 9:51 PM
4 points
9 comments5 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
38 points
2 comments2 min readLW link
(arxiv.org)

What’s go­ing on? LLMs and IS-A sen­tences

Bill BenzonNov 8, 2023, 4:58 PM
6 points
15 comments4 min readLW link

LLMs and com­pu­ta­tion complexity

Jonathan MarcusApr 28, 2023, 5:48 PM
57 points
29 comments5 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

Nov 9, 2023, 4:16 PM
51 points
0 comments6 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

AnnahNov 17, 2023, 1:54 PM
15 points
6 comments2 min readLW link

Re­search Adenda: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyPNov 13, 2023, 2:33 PM
27 points
0 comments12 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM
1 point
1 comment1 min readLW link

LLMs May Find It Hard to FOOM

RogerDearnaleyNov 15, 2023, 2:52 AM
11 points
30 comments12 min readLW link