RobertKirk comments on A transparency and interpretability tech tree

RobertKirk 20 Jun 2022 13:19 UTC
LW: 4 AF: 3
2
AF

Another point worth making here is why I haven’t separated out worst-case inspection transparency for deceptive models vs. worst-case training process transparency for deceptive models there. That’s because, while technically the latter is strictly more complicated than the former, I actually think that they’re likely to be equally difficult. In particular, I suspect that the only way that we might actually have a shot at understanding worst-case properties of deceptive models is through understanding how they’re trained.

I’d be curious to hear a bit more justification for this. It feels like resting on this intuition for a reason not to include worst-case inspection transparency for deceptive models as a separate node is a bit of a brittle choice (i.e. makes it more likely the tech tree would change if we got new information). You write

That is, if our ability to understand training dynamics is good enough, we might be able to make it impossible for a deceptive model to evade us by always being able to see its planning for how to do so during training.

which to me is a justification that worst-case inspection transparency for deceptive models is solved if we solve worst-case training process transparency for deceptive models, but not a justification that that’s the only way to solve it.