Hoagy

Karma: 1,068

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

138 points

15 comments13 min readLW link

Some additional SAE thoughts

HoagyJan 13, 2024, 7:31 PM

31 points

4 comments13 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

Sep 21, 2023, 3:30 PM

159 points

8 comments5 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

HoagyJul 17, 2023, 1:41 AM

57 points

1 comment7 min readLW link

[Replication] Conjecture’s Sparse Coding in Small Transformers

Hoagy and Logan Riggs

Jun 16, 2023, 6:02 PM

52 points

0 comments5 min readLW link

[Replication] Conjecture’s Sparse Coding in Toy Models

Hoagy and Logan Riggs

Jun 2, 2023, 5:34 PM

24 points

0 comments1 min readLW link

Universality and Hidden Information in Concept Bottleneck Models

HoagyApr 5, 2023, 2:00 PM

23 points

0 comments11 min readLW link

Nokens: A potential method of investigating glitch tokens

HoagyMar 15, 2023, 4:23 PM

21 points

0 comments4 min readLW link

Automating Consistency

HoagyFeb 17, 2023, 1:24 PM

10 points

0 comments1 min readLW link

Distilled Representations Research Agenda

Hoagy and mishajw

Oct 18, 2022, 8:59 PM

15 points

2 comments8 min readLW link

Remaking EfficientZero (as best I can)

HoagyJul 4, 2022, 11:03 AM

36 points

9 comments22 min readLW link

Note-Taking without Hidden Messages

HoagyApr 30, 2022, 11:15 AM

17 points

2 comments4 min readLW link

ELK Sub—Note-taking in internal rollouts

HoagyMar 9, 2022, 5:23 PM

6 points

0 comments5 min readLW link

Automated Fact Checking: A Look at the Field

HoagyOct 6, 2021, 11:52 PM

12 points

0 comments8 min readLW link

Hoagy’s Shortform

HoagySep 21, 2020, 10:00 PM

3 points

12 comments LW link

Safe Scrambling?

HoagyAug 29, 2020, 2:31 PM

3 points

1 comment2 min readLW link

When do utility functions constrain?

HoagyAug 23, 2019, 5:19 PM

30 points

8 comments7 min readLW link