Ethan Perez comments on We may be able to see sharp left turns coming

Ethan Perez Sep 3, 2022, 8:55 PM
LW: 2 AF: 2
1
AF
Agreed. I’d also add:
1. I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice.
2. I think we’ll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we’ll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.