My attempt to summarize the alignment concern here. Does this seem a reasonable gloss?
It seems plausible that competitive models will not be transparent or introspectable. If you can’t see how the model is making decisions, you can’t tell how it will generalize, and so you don’t get very good safety guarantees. Or to put it another way, if you can’t interact with the way the model is thinking, then you can’t give a rich enough reward signal to guide it to the region of model space that you want
This seems not false but it also seems like you’re emphasizing the wrong bits, e.g. I don’t think we quite need the model to be transparent/”see how it’s making decisions” to know how it will generalize.
At some point, model M will have knowledge that should enable it to do X task. However, it’s currently unclear how one would get M to do X in a way that doesn’t implicitly trust the model to be doing something it might not be doing. It’s a core problem of prosaic alignment to figure out how to get M to do X in a way that allows us to know that M is actually doing X-and-only-X instead of something else.
My attempt to summarize the alignment concern here. Does this seem a reasonable gloss?
This seems not false but it also seems like you’re emphasizing the wrong bits, e.g. I don’t think we quite need the model to be transparent/”see how it’s making decisions” to know how it will generalize.