This seems not false but it also seems like you’re emphasizing the wrong bits, e.g. I don’t think we quite need the model to be transparent/”see how it’s making decisions” to know how it will generalize.
At some point, model M will have knowledge that should enable it to do X task. However, it’s currently unclear how one would get M to do X in a way that doesn’t implicitly trust the model to be doing something it might not be doing. It’s a core problem of prosaic alignment to figure out how to get M to do X in a way that allows us to know that M is actually doing X-and-only-X instead of something else.
This seems not false but it also seems like you’re emphasizing the wrong bits, e.g. I don’t think we quite need the model to be transparent/”see how it’s making decisions” to know how it will generalize.