I think that’s also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models
Sure. To clarify, superior to what? “GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective”?
I’d describe the alternative perspective as: we try to think of GPT-3 as “knowing” some facts and having certain reasoning abilities. Then to predict how it behaves on a new input, we ask what the best next-token prediction is about the training distribution, given that knowledge and reasoning ability.
Of course the view isn’t “this is always what happens,” it’s a way of making a best guess. We could clarify how to set the error bars, or how to think more precisely about what “knowledge” and “reasoning abilities” mean. And our predictions depends on our prior over what knowledge and reasoning abilities models will have, which will be informed by a combination of estimates of algorithmic complexity of behaviors and bang-for-your-buck for different kinds of knowledge, but will ultimately depend on a lot of uncertain empirical facts about what kind of thing language models are able to learn. Overall I acknowledge you’d have to say a lot more to make this into something fully precise, and I’d guess the same will be true of a competing perspective.
I think this is roughly how many people make predictions about GPT-3, and in my experience it generally works pretty well and many apparent errors can be explained by more careful consideration of the training distribution. If we had a contest where you tried to give people short advice strings to help them predict GPT-3′s behavior, I think this kind of description would be an extremely strong entry.
This procedure is far from perfect. So you could imagine something else doing a lot better (or providing significant additional value as a complement).
Sure. To clarify, superior to what? “GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective”?
I’d describe the alternative perspective as: we try to think of GPT-3 as “knowing” some facts and having certain reasoning abilities. Then to predict how it behaves on a new input, we ask what the best next-token prediction is about the training distribution, given that knowledge and reasoning ability.
Of course the view isn’t “this is always what happens,” it’s a way of making a best guess. We could clarify how to set the error bars, or how to think more precisely about what “knowledge” and “reasoning abilities” mean. And our predictions depends on our prior over what knowledge and reasoning abilities models will have, which will be informed by a combination of estimates of algorithmic complexity of behaviors and bang-for-your-buck for different kinds of knowledge, but will ultimately depend on a lot of uncertain empirical facts about what kind of thing language models are able to learn. Overall I acknowledge you’d have to say a lot more to make this into something fully precise, and I’d guess the same will be true of a competing perspective.
I think this is roughly how many people make predictions about GPT-3, and in my experience it generally works pretty well and many apparent errors can be explained by more careful consideration of the training distribution. If we had a contest where you tried to give people short advice strings to help them predict GPT-3′s behavior, I think this kind of description would be an extremely strong entry.
This procedure is far from perfect. So you could imagine something else doing a lot better (or providing significant additional value as a complement).