RSS

Vlad Mikulik

Karma: 744

Rea­son­ing mod­els don’t always say what they think

Apr 9, 2025, 7:48 PM
25 points
4 comments1 min readLW link
(www.anthropic.com)

Au­to­mated Re­searchers Can Subtly Sandbag

Mar 26, 2025, 7:13 PM
41 points
0 comments4 min readLW link
(alignment.anthropic.com)

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Dec 18, 2023, 11:58 AM
147 points
21 comments10 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

Jul 20, 2023, 10:50 AM
44 points
3 comments2 min readLW link
(arxiv.org)

Speci­fi­ca­tion gam­ing: the flip side of AI ingenuity

May 6, 2020, 11:51 PM
66 points
9 comments6 min readLW link

Utility ≠ Reward

Vlad MikulikSep 5, 2019, 5:28 PM
131 points
24 comments1 min readLW link2 reviews

2-D Robustness

Vlad MikulikAug 30, 2019, 8:27 PM
85 points
8 comments2 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

Jun 7, 2019, 7:53 PM
82 points
5 comments6 min readLW link

De­cep­tive Alignment

Jun 5, 2019, 8:16 PM
118 points
20 comments17 min readLW link

The In­ner Align­ment Problem

Jun 4, 2019, 1:20 AM
104 points
17 comments13 min readLW link

Con­di­tions for Mesa-Optimization

Jun 1, 2019, 8:52 PM
84 points
48 comments12 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

May 31, 2019, 11:44 PM
187 points
42 comments12 min readLW link3 reviews

Clar­ify­ing Con­se­quen­tial­ists in the Solomonoff Prior

Vlad MikulikJul 11, 2018, 2:35 AM
20 points
16 comments6 min readLW link