Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Hoagy
Karma:
1,068
All
Posts
Comments
New
Top
Old
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
Mar 13, 2025, 7:18 PM
138
points
15
comments
13
min read
LW
link
Some additional SAE thoughts
Hoagy
Jan 13, 2024, 7:31 PM
31
points
4
comments
13
min read
LW
link
Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Logan Riggs
,
Hoagy
,
Aidan Ewart
and
Robert_AIZI
Sep 21, 2023, 3:30 PM
159
points
8
comments
5
min read
LW
link
AutoInterpretation Finds Sparse Coding Beats Alternatives
Hoagy
Jul 17, 2023, 1:41 AM
57
points
1
comment
7
min read
LW
link
[Replication] Conjecture’s Sparse Coding in Small Transformers
Hoagy
and
Logan Riggs
Jun 16, 2023, 6:02 PM
52
points
0
comments
5
min read
LW
link
[Replication] Conjecture’s Sparse Coding in Toy Models
Hoagy
and
Logan Riggs
Jun 2, 2023, 5:34 PM
24
points
0
comments
1
min read
LW
link
Universality and Hidden Information in Concept Bottleneck Models
Hoagy
Apr 5, 2023, 2:00 PM
23
points
0
comments
11
min read
LW
link
Nokens: A potential method of investigating glitch tokens
Hoagy
Mar 15, 2023, 4:23 PM
21
points
0
comments
4
min read
LW
link
Automating Consistency
Hoagy
Feb 17, 2023, 1:24 PM
10
points
0
comments
1
min read
LW
link
Distilled Representations Research Agenda
Hoagy
and
mishajw
Oct 18, 2022, 8:59 PM
15
points
2
comments
8
min read
LW
link
Remaking EfficientZero (as best I can)
Hoagy
Jul 4, 2022, 11:03 AM
36
points
9
comments
22
min read
LW
link
Note-Taking without Hidden Messages
Hoagy
Apr 30, 2022, 11:15 AM
17
points
2
comments
4
min read
LW
link
ELK Sub—Note-taking in internal rollouts
Hoagy
Mar 9, 2022, 5:23 PM
6
points
0
comments
5
min read
LW
link
Automated Fact Checking: A Look at the Field
Hoagy
Oct 6, 2021, 11:52 PM
12
points
0
comments
8
min read
LW
link
Hoagy’s Shortform
Hoagy
Sep 21, 2020, 10:00 PM
3
points
12
comments
LW
link
Safe Scrambling?
Hoagy
Aug 29, 2020, 2:31 PM
3
points
1
comment
2
min read
LW
link
When do utility functions constrain?
Hoagy
Aug 23, 2019, 5:19 PM
30
points
8
comments
7
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel