RSS

Chain-of-Thought Alignment

TagLast edit: Dec 1, 2023, 8:58 PM by niplav

“Chain-of-thought” autonomous agentic wrappers such as AutoGPT around an LLM such as GPT-4, and similar Language Model Cognitive Architectures (LMCAs) (other commonly used terms are Language Model Autonomous Agents (LMAAs), or Scaffolded LLMs), are a recent candidate approach to building an AGI.

They create, edit, and maintain a natural language context by recursively feeding parts of this into the LLM along with suitable prompts for activities like subtask planning, self-criticism, and memory summarization, generating a textual stream-of-consciousness, memories etc. They thus combine LLM neural nets with natural language symbolic thinking more along the lines of GOFAI.

Recent open-source examples are quite simple and not particularly capable, but it seems rather plausible that they could progress rapidly. They could make interpretability much easier than pure neural net systems, since their ‘chain-of-though’/​‘stream of consciousness’ and ‘memories’ would be written in human natural language, so interpretable and editable by a monitoring human or LLM-based monitoring system (modulo concerns about opaque natural language or detecting possible hidden steganographic side-channels concealed in apparently-innocent natural language). This topic discusses the alignment problem for systems combining such agentic wrappers with LLMs, if they are in fact capable of approaching or reaching AGI.

See Also

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth HerdApr 18, 2023, 4:29 PM
88 points
18 comments20 min readLW link

Align­ment of Au­toGPT agents

OzyrusApr 12, 2023, 12:54 PM
14 points
1 comment4 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tameraAug 3, 2022, 12:03 PM
136 points
23 comments6 min readLW link

Scaf­folded LLMs: Less Ob­vi­ous Concerns

Stephen FowlerJun 16, 2023, 10:39 AM
34 points
15 comments14 min readLW link

the case for CoT un­faith­ful­ness is overstated

nostalgebraistSep 29, 2024, 10:07 PM
259 points
43 comments11 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

May 28, 2023, 7:10 PM
39 points
14 comments26 min readLW link

5 ways to im­prove CoT faithfulness

Caleb BiddulphOct 5, 2024, 8:17 PM
42 points
40 comments6 min readLW link

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

JozdienJan 13, 2023, 6:42 PM
36 points
3 comments3 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM
28 points
7 comments8 min readLW link1 review

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Miles TurpinJun 3, 2023, 12:22 AM
42 points
8 comments7 min readLW link

Think­ing LLMs: Gen­eral In­struc­tion Fol­low­ing with Thought Generation

Bogdan Ionut CirsteaOct 15, 2024, 9:21 AM
7 points
0 comments1 min readLW link
(arxiv.org)

To CoT or not to CoT? Chain-of-thought helps mainly on math and sym­bolic reasoning

Bogdan Ionut CirsteaSep 19, 2024, 4:13 PM
21 points
1 comment1 min readLW link
(arxiv.org)

Sys­tem 2 Alignment

Seth HerdFeb 13, 2025, 7:17 PM
35 points
0 comments22 min readLW link

AI CoT Rea­son­ing Is Often Unfaithful

ZviApr 4, 2025, 2:50 PM
66 points
4 comments7 min readLW link
(thezvi.wordpress.com)

[Question] Should Au­toGPT up­date us to­wards re­search­ing IDA?

Michaël TrazziApr 12, 2023, 4:41 PM
15 points
5 comments1 min readLW link

A Lit­tle Depth Goes a Long Way: the Ex­pres­sive Power of Log-Depth Transformers

Bogdan Ionut CirsteaNov 20, 2024, 11:48 AM
16 points
0 comments1 min readLW link
(openreview.net)

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth HerdJul 7, 2023, 6:54 AM
55 points
30 comments11 min readLW link

Sleep peace­fully: no hid­den rea­son­ing de­tected in LLMs. Well, at least in small ones.

Apr 4, 2025, 8:49 PM
16 points
2 comments7 min readLW link

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

Max HOct 11, 2023, 12:53 AM
35 points
5 comments11 min readLW link

LLM AGI will have mem­ory, and mem­ory changes alignment

Seth HerdApr 4, 2025, 2:59 PM
70 points
13 comments9 min readLW link

Seven sources of goals in LLM agents

Seth HerdFeb 8, 2025, 9:54 PM
22 points
3 comments2 min readLW link

Do Large Lan­guage Models Perform La­tent Multi-Hop Rea­son­ing with­out Ex­ploit­ing Short­cuts?

Bogdan Ionut CirsteaNov 26, 2024, 9:58 AM
9 points
0 comments1 min readLW link
(arxiv.org)

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

Orpheus16Dec 20, 2022, 9:39 PM
18 points
2 comments11 min readLW link

Steganog­ra­phy in Chain of Thought Reasoning

A RayAug 8, 2022, 3:47 AM
62 points
13 comments6 min readLW link

LLMs Do Not Think Step-by-step In Im­plicit Reasoning

Bogdan Ionut CirsteaNov 28, 2024, 9:16 AM
11 points
0 comments1 min readLW link
(arxiv.org)

We should start look­ing for schem­ing “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM
89 points
4 comments5 min readLW link

On AutoGPT

ZviApr 13, 2023, 12:30 PM
248 points
47 comments20 min readLW link
(thezvi.wordpress.com)

Shane Legg in­ter­view on alignment

Seth HerdOct 28, 2023, 7:28 PM
66 points
20 comments2 min readLW link
(www.youtube.com)

We have promis­ing al­ign­ment plans with low taxes

Seth HerdNov 10, 2023, 6:51 PM
44 points
9 comments5 min readLW link

An idea for avoid­ing neu­ralese architectures

Knight LeeApr 3, 2025, 10:23 PM
7 points
2 comments4 min readLW link

Si­mu­la­tors, con­straints, and goal ag­nos­ti­cism: por­bynotes vol. 1

porbyNov 23, 2022, 4:22 AM
37 points
2 comments35 min readLW link

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Peter HroššoMar 27, 2023, 2:36 PM
13 points
7 comments1 min readLW link

Shap­ley Value At­tri­bu­tion in Chain of Thought

leogaoApr 14, 2023, 5:56 AM
106 points
7 comments4 min readLW link

Au­tomat­ing Consistency

HoagyFeb 17, 2023, 1:24 PM
10 points
0 comments1 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

OzyrusMay 17, 2023, 2:13 PM
21 points
4 comments13 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

OzyrusMay 17, 2023, 2:13 PM
3 points
1 comment3 min readLW link

Aligned AI via mon­i­tor­ing ob­jec­tives in Au­toGPT-like systems

Paul CologneseMay 24, 2023, 3:59 PM
27 points
4 comments4 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points
3 comments15 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc10014Jun 21, 2023, 3:43 PM
21 points
1 comment31 min readLW link

On the Im­pli­ca­tions of Re­cent Re­sults on La­tent Rea­son­ing in LLMs

Rauno ArikeMar 31, 2025, 11:06 AM
31 points
6 comments13 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Jul 18, 2023, 4:36 PM
111 points
15 comments6 min readLW link1 review

Distil­led Rep­re­sen­ta­tions Re­search Agenda

Oct 18, 2022, 8:59 PM
15 points
2 comments8 min readLW link

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. MurphyOct 2, 2022, 1:29 AM
52 points
15 comments1 min readLW link
(openreview.net)

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien RogerMar 9, 2023, 4:30 PM
142 points
7 comments19 min readLW link

Un­der­stand­ing Hid­den Com­pu­ta­tions in Chain-of-Thought Reasoning

rokosbasiliskAug 24, 2024, 4:35 PM
6 points
1 comment1 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myersFeb 9, 2024, 6:40 PM
6 points
12 comments3 min readLW link

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Miles TurpinMar 11, 2024, 11:46 PM
16 points
0 comments1 min readLW link
(arxiv.org)

Lan­guage and Ca­pa­bil­ities: Test­ing LLM Math­e­mat­i­cal Abil­ities Across Languages

Ethan EdwardsApr 4, 2024, 1:18 PM
24 points
2 comments36 min readLW link

[Question] What faith­ful­ness met­rics should gen­eral claims about CoT faith­ful­ness be based upon?

Rauno ArikeApr 8, 2025, 3:27 PM
24 points
0 comments4 min readLW link

Whirlwind Tour of Chain of Thought Liter­a­ture Rele­vant to Au­tomat­ing Align­ment Re­search.

sevdeawesomeJul 1, 2024, 5:50 AM
25 points
0 comments17 min readLW link

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Filip SondejDec 16, 2024, 1:48 PM
81 points
9 comments4 min readLW link

AI Align­ment and the Quest for Ar­tifi­cial Wisdom

MyspyJul 12, 2024, 9:34 PM
1 point
0 comments13 min readLW link

Sim­ple Stegano­graphic Com­pu­ta­tion Eval—gpt-4o and gem­ini-exp-1206 can’t solve it yet

Filip SondejDec 19, 2024, 3:47 PM
14 points
2 comments3 min readLW link

AGI with RL is Bad News for Safety

Nadav BrandesDec 21, 2024, 7:36 PM
19 points
22 comments2 min readLW link

Re­duce AI Self-Alle­giance by say­ing “he” in­stead of “I”

Knight LeeDec 23, 2024, 9:32 AM
10 points
4 comments2 min readLW link

Meta AI (FAIR) lat­est pa­per in­te­grates sys­tem-1 and sys­tem-2 think­ing into rea­son­ing mod­els.

happy fridayOct 24, 2024, 4:54 PM
8 points
0 comments1 min readLW link

~80 In­ter­est­ing Ques­tions about Foun­da­tion Model Agent Safety

Oct 28, 2024, 4:37 PM
46 points
4 comments15 min readLW link

The Lan­guage Bot­tle­neck in AI Rea­son­ing: Are We For­get­ting to Think?

WotakerMar 8, 2025, 1:44 PM
1 point
0 comments7 min readLW link

Mea­sur­ing Beliefs of Lan­guage Models Dur­ing Chain-of-Thought Reasoning

Apr 18, 2025, 10:56 PM
8 points
0 comments13 min readLW link

When the Model Starts Talk­ing Like Me: A User-In­duced Struc­tural Adap­ta­tion Case Study

JunxiApr 19, 2025, 7:40 PM
3 points
1 comment4 min readLW link

In­fer­ence-Time-Com­pute: More Faith­ful? A Re­search Note

Jan 15, 2025, 4:43 AM
69 points
10 comments11 min readLW link

Wor­ries about la­tent rea­son­ing in LLMs

Caleb BiddulphJan 20, 2025, 9:09 AM
42 points
6 comments7 min readLW link

Find­ing an Er­ror-De­tec­tion Fea­ture in Deep­Seek-R1

keith_wynroeApr 24, 2025, 4:03 PM
15 points
0 comments7 min readLW link

Post-hoc rea­son­ing in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM
16 points
0 comments11 min readLW link

Deep­Seek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM
12 points
0 comments8 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

Mar 30, 2023, 2:11 PM
71 points
3 comments10 min readLW link