Ty for review. I still think it’s better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.
Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level “case for misalignment as an x-risk” has been accepted at a major ML conference, to my knowledge. (Though Langosco’s goal misgeneralization paper did this a little bit, and was accepted at ICML.)
Ty for review. I still think it’s better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.
Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level “case for misalignment as an x-risk” has been accepted at a major ML conference, to my knowledge. (Though Langosco’s goal misgeneralization paper did this a little bit, and was accepted at ICML.)