The main criteria I think is not broadcasting it to organizations without a plan for aligning strong superintelligence that has a chance of working. This probably means not publishing, and also means being at an employer who has a workable plan.
There might be some types of interp which don’t have much capabilities potential, and are therefore safe to publish widely. Maybe some of the work focused specifically on detecting deception? But mostly I expect interp to be good only when part of a wider plan with a specific backchained way to end the acute risk period, which might take advantage of the capabilities boosts offered. My steelman of Anthropic is trying to pull something like this off, though they’re pretty careful to avoid leaking details of what their wider plan is if they have one.
What kind of interpretability work do you consider plausibly useful or at least not counterproductive?
The main criteria I think is not broadcasting it to organizations without a plan for aligning strong superintelligence that has a chance of working. This probably means not publishing, and also means being at an employer who has a workable plan.
There might be some types of interp which don’t have much capabilities potential, and are therefore safe to publish widely. Maybe some of the work focused specifically on detecting deception? But mostly I expect interp to be good only when part of a wider plan with a specific backchained way to end the acute risk period, which might take advantage of the capabilities boosts offered. My steelman of Anthropic is trying to pull something like this off, though they’re pretty careful to avoid leaking details of what their wider plan is if they have one.