I think the key missing piece you’re pointing at (making sure that our interpretability tools etc actually tell us something alignment-relevant) is one of the big things going on in model organisms of misalignment (iirc there’s a step that’s like ‘ok, but if we do interpretability/control/etc at the model organism does that help?’). Ideally this type of work, or something close to it, could become more common // provide ‘evals for our evals’ // expand in scope and application beyond deep deception.
If that happened, it seems like it would fit the bill here.
Oh except: I did not necessarily mean to claim that any of the things I mentioned were missing from the alignment research scene, or that they were present.
I think the key missing piece you’re pointing at (making sure that our interpretability tools etc actually tell us something alignment-relevant) is one of the big things going on in model organisms of misalignment (iirc there’s a step that’s like ‘ok, but if we do interpretability/control/etc at the model organism does that help?’). Ideally this type of work, or something close to it, could become more common // provide ‘evals for our evals’ // expand in scope and application beyond deep deception.
If that happened, it seems like it would fit the bill here.
Does that seem true to you?
Yeah, that seems right to me.
Oh except: I did not necessarily mean to claim that any of the things I mentioned were missing from the alignment research scene, or that they were present.