Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Vlad Mikulik
Karma:
744
All
Posts
Comments
New
Top
Old
Reasoning models don’t always say what they think
Joe Benton
,
Ethan Perez
,
Vlad Mikulik
and
Fabien Roger
Apr 9, 2025, 7:48 PM
25
points
4
comments
1
min read
LW
link
(www.anthropic.com)
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
and
Fabien Roger
Mar 26, 2025, 7:13 PM
41
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Seb Farquhar
,
Vikrant Varma
,
zac_kenton
,
gasteigerjo
,
Vlad Mikulik
and
Rohin Shah
Dec 18, 2023, 11:58 AM
147
points
21
comments
10
min read
LW
link
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Neel Nanda
,
Tom Lieberum
,
Matthew Rahtz
,
János Kramár
,
Geoffrey Irving
,
Rohin Shah
and
Vlad Mikulik
Jul 20, 2023, 10:50 AM
44
points
3
comments
2
min read
LW
link
(arxiv.org)
Specification gaming: the flip side of AI ingenuity
Vika
,
Vlad Mikulik
,
Matthew Rahtz
,
tom4everitt
,
Zac Kenton
and
janleike
May 6, 2020, 11:51 PM
66
points
9
comments
6
min read
LW
link
Utility ≠ Reward
Vlad Mikulik
Sep 5, 2019, 5:28 PM
131
points
24
comments
1
min read
LW
link
2
reviews
2-D Robustness
Vlad Mikulik
Aug 30, 2019, 8:27 PM
85
points
8
comments
2
min read
LW
link
Risks from Learned Optimization: Conclusion and Related Work
evhub
,
Chris van Merwijk
,
Vlad Mikulik
,
Joar Skalse
and
Scott Garrabrant
Jun 7, 2019, 7:53 PM
82
points
5
comments
6
min read
LW
link
Deceptive Alignment
evhub
,
Chris van Merwijk
,
Vlad Mikulik
,
Joar Skalse
and
Scott Garrabrant
Jun 5, 2019, 8:16 PM
118
points
20
comments
17
min read
LW
link
The Inner Alignment Problem
evhub
,
Chris van Merwijk
,
Vlad Mikulik
,
Joar Skalse
and
Scott Garrabrant
Jun 4, 2019, 1:20 AM
104
points
17
comments
13
min read
LW
link
Conditions for Mesa-Optimization
evhub
,
Chris van Merwijk
,
Vlad Mikulik
,
Joar Skalse
and
Scott Garrabrant
Jun 1, 2019, 8:52 PM
84
points
48
comments
12
min read
LW
link
Risks from Learned Optimization: Introduction
evhub
,
Chris van Merwijk
,
Vlad Mikulik
,
Joar Skalse
and
Scott Garrabrant
May 31, 2019, 11:44 PM
187
points
42
comments
12
min read
LW
link
3
reviews
Clarifying Consequentialists in the Solomonoff Prior
Vlad Mikulik
Jul 11, 2018, 2:35 AM
20
points
16
comments
6
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel