William_S

Karma: 1,855

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Principles for the AGI Race

William_SAug 30, 2024, 2:29 PM

246 points

17 comments18 min readLW link

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

Jul 12, 2024, 3:47 AM

104 points

5 comments7 min readLW link

(arxiv.org)

William_S’s Shortform

William_SMar 22, 2023, 6:13 PM

5 points

48 comments1 min readLW link

Thoughts on refusing harmful requests to large language models

William_SJan 19, 2023, 7:49 PM

32 points

4 comments2 min readLW link

Prize for Alignment Research Tasks

stuhlmueller and William_S

Apr 29, 2022, 8:57 AM

64 points

38 comments10 min readLW link

[Question] Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

William_SFeb 19, 2020, 1:07 AM

16 points

5 comments1 min readLW link

Machine Learning Projects on IDA

Owain_Evans, William_S and stuhlmueller

Jun 24, 2019, 6:38 PM

49 points

3 comments2 min readLW link

Reinforcement Learning in the Iterated Amplification Framework

William_SFeb 9, 2019, 12:56 AM

25 points

12 comments4 min readLW link

HCH is not just Mechanical Turk

William_SFeb 9, 2019, 12:46 AM

42 points

6 comments3 min readLW link

Amplification Discussion Notes

William_SJun 1, 2018, 7:03 PM

17 points

3 comments3 min readLW link

Understanding Iterated Distillation and Amplification: Claims and Oversight

William_SApr 17, 2018, 10:36 PM

35 points

30 comments9 min readLW link

Improbable Oversight, An Attempt at Informed Oversight

William_SMay 24, 2017, 5:43 PM

3 points

9 comments1 min readLW link

(william-r-s.github.io)

Informed Oversight through Generalizing Explanations

William_SMay 24, 2017, 5:43 PM

2 points

0 comments1 min readLW link

(william-r-s.github.io)

Proposal for an Implementable Toy Model of Informed Oversight

William_SMay 24, 2017, 5:43 PM

2 points

1 comment1 min readLW link

(william-r-s.github.io)