I am a bit confused by your operationalization of “Dangerous”. On one hand
I posit that interpretability work is “dangerous” when it enhances the overall capabilities of an AI system, without making that system more aligned with human goals
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
This suggests a few concrete rules-of-thumb, which a researcher can apply to their interpretability project P: …
If P makes it easier/more efficient to train powerful AI models, thenP is dangerous.
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we’re using here)
Good point. My basic idea is something like “most interp work makes it more efficient to train/use increasingly-powerful/dangerous models”. So I think the two uses of “dangerous” you quote here, both fit with this idea.
I am a bit confused by your operationalization of “Dangerous”. On one hand
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we’re using here)
Good point. My basic idea is something like “most interp work makes it more efficient to train/use increasingly-powerful/dangerous models”. So I think the two uses of “dangerous” you quote here, both fit with this idea.