I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice.
I think we’ll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we’ll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.
Agreed. I’d also add:
I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice.
I think we’ll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we’ll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.