Seems great! I’m excited about potential interpretability methods for detecting deception.
I think you’re right about the current trade-offs on the gain of function stuff, but it’s good to think ahead and have precommitments for the conditions under which your strategies there should change.
Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.
Seems great! I’m excited about potential interpretability methods for detecting deception.
I think you’re right about the current trade-offs on the gain of function stuff, but it’s good to think ahead and have precommitments for the conditions under which your strategies there should change.
It may be hard to find evals for deception which are sufficiently convincing when they trigger, yet still give us enough time to react afterwards. A few more similar points here: https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/?commentId=8qSAaFJXcmNhtC8am
Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.