Alignment Hot Take Advent CalendarCharlie Steiner1 Dec 2022 22:42 UTCTake 1: We’re not going to reverse-engineer the AI.Charlie Steiner1 Dec 2022 22:41 UTC38 points4 comments4 min readLW linkTake 2: Building tools to help build FAI is a legitimate strategy, but it’s dual-use.Charlie Steiner3 Dec 2022 0:54 UTC17 points1 comment2 min readLW linkTake 3: No indescribable heavenworlds.Charlie Steiner4 Dec 2022 2:48 UTC23 points12 comments2 min readLW linkTake 4: One problem with natural abstractions is there’s too many of them.Charlie Steiner5 Dec 2022 10:39 UTC37 points4 comments1 min readLW linkTake 5: Another problem for natural abstractions is laziness.Charlie Steiner6 Dec 2022 7:00 UTC31 points4 comments3 min readLW linkTake 6: CAIS is actually Orwellian.Charlie Steiner7 Dec 2022 13:50 UTC14 points8 comments2 min readLW linkTake 7: You should talk about “the human’s utility function” less.Charlie Steiner8 Dec 2022 8:14 UTC50 points22 comments2 min readLW linkTake 8: Queer the inner/outer alignment dichotomy.Charlie Steiner9 Dec 2022 17:46 UTC28 points2 comments2 min readLW linkTake 9: No, RLHF/IDA/debate doesn’t solve outer alignment.Charlie Steiner12 Dec 2022 11:51 UTC33 points13 comments2 min readLW linkTake 10: Fine-tuning with RLHF is aesthetically unsatisfying.Charlie Steiner13 Dec 2022 7:04 UTC37 points3 comments2 min readLW linkTake 11: “Aligning language models” should be weirder.Charlie Steiner18 Dec 2022 14:14 UTC34 points0 comments2 min readLW linkTake 12: RLHF’s use is evidence that orgs will jam RL at real-world problems.Charlie Steiner20 Dec 2022 5:01 UTC25 points1 comment3 min readLW linkTake 13: RLHF bad, conditioning good.Charlie Steiner22 Dec 2022 10:44 UTC54 points4 comments2 min readLW linkTake 14: Corrigibility isn’t that great.Charlie Steiner25 Dec 2022 13:04 UTC15 points3 comments3 min readLW link