Winding My Way Through Alignment

May 5, 2022, 3:04 AM

A sequence of my alignment distillations, written up as I work my way through understanding AI alignment theory.

My rough guiding research algorithm is to focus in on the biggest hazard in my current model of alignment, try to understand and explain that hazard and the proposed solutions to it, and then recurse.

This now-finished sequence is representative of my developing alignment model before I became substantially informationally entangled, in-person, with the Berkeley alignment community. It’s what I was able to glean from just reading a lot online.

HCH and Adversarial Questions

David UdellFeb 19, 2022, 12:52 AM

15 points

7 comments26 min readLW link

Agency and Coherence

David UdellMar 26, 2022, 7:25 PM

25 points

2 comments3 min readLW link

Deceptive Agents are a Good Way to Do Things

David UdellApr 19, 2022, 6:04 PM

16 points

0 comments1 min readLW link

But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?

David UdellMay 7, 2022, 3:10 AM

26 points

18 comments5 min readLW link

Gato as the Dawn of Early AGI

David UdellMay 15, 2022, 6:52 AM

85 points

29 comments12 min readLW link

Intelligence in Commitment Races

David UdellJun 24, 2022, 2:30 PM

28 points

8 comments5 min readLW link

How Deadly Will Roughly-Human-Level AGI Be?

David UdellAug 8, 2022, 1:59 AM

12 points

6 comments1 min readLW link

Winding My Way Through Alignment

HCH and Ad­ver­sar­ial Questions

Agency and Coherence

De­cep­tive Agents are a Good Way to Do Things

But What’s Your *New Align­ment In­sight,* out of a Fu­ture-Text­book Para­graph?