Shard TheoryQuintin PopeJul 14, 2022, 1:36 AMWritten by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:Humans provide an untapped wealth of evidence about alignmentTurnTrout and Quintin PopeJul 14, 2022, 2:31 AM211 points94 comments9 min readLW link1 reviewHuman values & biases are inaccessible to the genomeTurnTroutJul 7, 2022, 5:29 PM94 points54 comments6 min readLW link1 reviewGeneral alignment propertiesTurnTroutAug 8, 2022, 11:40 PM50 points2 comments1 min readLW linkEvolution is a bad analogy for AGI: inner alignmentQuintin PopeAug 13, 2022, 10:15 PM79 points15 comments8 min readLW linkReward is not the optimization targetTurnTroutJul 25, 2022, 12:03 AM375 points123 comments10 min readLW link3 reviewsThe shard theory of human valuesQuintin Pope and TurnTroutSep 4, 2022, 4:28 AM255 points67 comments24 min readLW link2 reviewsUnderstanding and avoiding value driftTurnTroutSep 9, 2022, 4:16 AM48 points14 comments6 min readLW linkA shot at the diamond-alignment problemTurnTroutOct 6, 2022, 6:29 PM95 points67 comments15 min readLW linkDon’t design agents which exploit adversarial inputsTurnTrout and Garrett BakerNov 18, 2022, 1:48 AM72 points64 comments12 min readLW linkDon’t align agents to evaluations of plansTurnTroutNov 26, 2022, 9:16 PM48 points49 comments18 min readLW linkAlignment allows “nonrobust” decision-influences and doesn’t require robust gradingTurnTroutNov 29, 2022, 6:23 AM62 points41 comments15 min readLW linkInner and outer alignment decompose one hard problem into two extremely hard problemsTurnTroutDec 2, 2022, 2:43 AM148 points22 comments47 min readLW link3 reviews