TurnTrout comments on AGI Ruin: A List of Lethalities

TurnTrout 8 Jun 2022 19:01 UTC
LW: 3 AF: 2
1
AF
One of the problems with English is that it doesn’t natively support orders of magnitude for “unreliable.” Do you mean “unreliable” as in “between 1% and 50% of people end up with part of their values not related to objects-in-reality”, or as in “there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process”? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings.
EDIT: My reader-model is flagging this whole comment as pedagogically inadequate, so I’ll point to the second half of section 5 in my shard theory document.