RSS

William_S

Karma: 1,855

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Prin­ci­ples for the AGI Race

William_SAug 30, 2024, 2:29 PM
246 points
17 comments18 min readLW link

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

Jul 12, 2024, 3:47 AM
104 points
5 comments7 min readLW link
(arxiv.org)

William_S’s Shortform

William_SMar 22, 2023, 6:13 PM
5 points
48 comments1 min readLW link

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

William_SJan 19, 2023, 7:49 PM
32 points
4 comments2 min readLW link

Prize for Align­ment Re­search Tasks

Apr 29, 2022, 8:57 AM
64 points
38 comments10 min readLW link

[Question] Is there an in­tu­itive way to ex­plain how much bet­ter su­perfore­cast­ers are than reg­u­lar fore­cast­ers?

William_SFeb 19, 2020, 1:07 AM
16 points
5 comments1 min readLW link

Ma­chine Learn­ing Pro­jects on IDA

Jun 24, 2019, 6:38 PM
49 points
3 comments2 min readLW link

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

William_SFeb 9, 2019, 12:56 AM
25 points
12 comments4 min readLW link

HCH is not just Me­chan­i­cal Turk

William_SFeb 9, 2019, 12:46 AM
42 points
6 comments3 min readLW link

Am­plifi­ca­tion Dis­cus­sion Notes

William_SJun 1, 2018, 7:03 PM
17 points
3 comments3 min readLW link

Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion: Claims and Oversight

William_SApr 17, 2018, 10:36 PM
35 points
30 comments9 min readLW link

Im­prob­a­ble Over­sight, An At­tempt at In­formed Oversight

William_SMay 24, 2017, 5:43 PM
3 points
9 comments1 min readLW link
(william-r-s.github.io)

In­formed Over­sight through Gen­er­al­iz­ing Explanations

William_SMay 24, 2017, 5:43 PM
2 points
0 comments1 min readLW link
(william-r-s.github.io)

Pro­posal for an Im­ple­mentable Toy Model of In­formed Oversight

William_SMay 24, 2017, 5:43 PM
2 points
1 comment1 min readLW link
(william-r-s.github.io)