From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability
Nevertheless, I think that transparency-and-interpretability-based training rationales are some of the most exciting, as unlike inductive bias analysis, they actually provide feedback during training, potentially letting us see problems as they arise rather than having to get everything right in advance.
I definitely don’t think this—in fact, I tend to think that cognitive interpretability is probably the only way we can plausibly get high levels of confidence in the safety of a training process. From “How do we become confident in the safety of a machine learning system?”:
See also: “A transparency and interpretability tech tree”