Eli Tyre comments on AGI Ruin: A List of Lethalities

Eli Tyre 12 Oct 2023 9:47 UTC
2 points
0
Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
Again, this seems less true the better your adversarial training setup is at identifying the test cases in which you’re likely to be misaligned?