RSS

William_S(William Saunders)

Karma: 1,756

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Prin­ci­ples for the AGI Race

William_S30 Aug 2024 14:29 UTC
242 points
13 comments18 min readLW link

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

12 Jul 2024 3:47 UTC
98 points
5 comments7 min readLW link
(arxiv.org)

William_S’s Shortform

William_S22 Mar 2023 18:13 UTC
5 points
39 comments1 min readLW link

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

William_S19 Jan 2023 19:49 UTC
32 points
4 comments2 min readLW link

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
64 points
38 comments10 min readLW link

[Question] Is there an in­tu­itive way to ex­plain how much bet­ter su­perfore­cast­ers are than reg­u­lar fore­cast­ers?

William_S19 Feb 2020 1:07 UTC
16 points
5 comments1 min readLW link

Ma­chine Learn­ing Pro­jects on IDA

24 Jun 2019 18:38 UTC
49 points
3 comments2 min readLW link

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

William_S9 Feb 2019 0:56 UTC
25 points
12 comments4 min readLW link

HCH is not just Me­chan­i­cal Turk

William_S9 Feb 2019 0:46 UTC
42 points
6 comments3 min readLW link

Am­plifi­ca­tion Dis­cus­sion Notes

William_S1 Jun 2018 19:03 UTC
17 points
3 comments3 min readLW link

Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion: Claims and Oversight

William_S17 Apr 2018 22:36 UTC
35 points
30 comments9 min readLW link

Im­prob­a­ble Over­sight, An At­tempt at In­formed Oversight

William_S24 May 2017 17:43 UTC
3 points
9 comments1 min readLW link
(william-r-s.github.io)

In­formed Over­sight through Gen­er­al­iz­ing Explanations

William_S24 May 2017 17:43 UTC
2 points
0 comments1 min readLW link
(william-r-s.github.io)

Pro­posal for an Im­ple­mentable Toy Model of In­formed Oversight

William_S24 May 2017 17:43 UTC
2 points
1 comment1 min readLW link
(william-r-s.github.io)