RSS

Johannes Treutlein

Karma: 1,569

All opinions are my own. Homepage: johannestreutlein.com

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

Jul 24, 2025, 7:22 PM
46 points
1 comment5 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

Apr 24, 2025, 9:15 PM
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
141 points
15 comments13 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
489 points
75 comments10 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Jun 21, 2024, 3:54 PM
163 points
13 comments8 min readLW link
(arxiv.org)

Re­port on mod­el­ing ev­i­den­tial co­op­er­a­tion in large worlds

Johannes TreutleinJul 12, 2023, 4:37 PM
45 points
3 comments1 min readLW link
(arxiv.org)