I think it’s often helpful to talk about whether a policy which behaves badly (i.e. which selects actions have consequences we dislike, or like less than some alternative actions that it “could have” selected) will receive a high loss. One reason this is helpful is because we expect SGD to correct behaviors that lead to a high loss, given enough time, and so if bad behavior gets a high loss then we only need to think about optimization difficulties or cases where the failures can be catastrophic before SGD can correct them. Another reason this is helpful is that if we use SGD to select policies that get a low loss across a broad range of training environments, we have prima facie reason to expect the resulting policies to get a low loss in other environments.
I interpret you as making the claim (across this and your other recent posts): don’t expect policies to get a low loss just because they were selected for getting a low loss, instead think about how SGD steps will shape what they are “trying” to do and use that to reason directly about their generalization behavior. I think that’s also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models, or for designing competitive alternatives to those systems (or else the real meat is talking about what distinguishes future AI systems from current AI systems such that this analysis approach will become more appropriate later even if it’s less appropriate now). I’d be significantly more interested to see posts and experiments engaging with that.
As an aside: I think the physical implementation of the loss function is most relevant if your loss function is based on the reward that results from executing a given action (or on predicting actions based on their measured consequences). If your AI is instead trained via imitation learning or process-based feedback, the physical implementation of the loss does not seem to have special significance, e.g. SGD will not select policies that intervene in the physical world in order to change the physical implementation of their loss. (Though of course there can be policies that intervene in the physical world in order to change the dynamics of the learning process itself, and in some sense no physically implemented process could select for loss—a cognitive pattern that simply grabs resources by means unrelated to the loss function will ultimately be favored by the physical world no matter what training code you wrote down.)
Note that this is all less relevant to ARC because our goal is to find strategies for which we can’t tell a plausible story about why the resulting AI would tend to take creative and coherent actions to kill you. From our perspective, it’s evidently plausible for SGD to find a system that reasons explicitly about how to achieve a low loss (and so if such systems would kill you then that’s a problem), as well as plausible for SGD to find a system that behaves in a completely different way and can’t even be described as pursuing goals (so if you are relying on goal-directed behavior of any kind then that’s a problem). So we can’t really rely on any approaches like the ones you are either advocating or critiquing. Of course most people consider our goal completely hopeless, and I’m only saying this to clarify how ARC thinks about these things.
I interpret you as making the claim (across this and your other recent posts): don’t expect policies to get a low loss just because they were selected for getting a low loss, instead think about how SGD steps will shape what they are “trying” to do and use that to reason directly about their generalization behavior.
Yeah… I interpret TurnTrout as saying “look I know it seems straightforward to say that we are optimizing over policies rather than building policies that optimize for reward, but actually this difference is incredibly subtle”. And I think he’s right that this exact point has the kind of subtlety that just keeps biting again and again. I have the sense that this distinction held up evolutionary biology for decades.
Nevertheless, yes, as you say, the question is how to in fact reason from “policies selected according to such-and-such loss” to “any guarantees whatsoever about general behavior of policy”. I wish we could say more about why this part of the problem is so ferociously difficult.
I think that’s also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models
Sure. To clarify, superior to what? “GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective”?
I’d describe the alternative perspective as: we try to think of GPT-3 as “knowing” some facts and having certain reasoning abilities. Then to predict how it behaves on a new input, we ask what the best next-token prediction is about the training distribution, given that knowledge and reasoning ability.
Of course the view isn’t “this is always what happens,” it’s a way of making a best guess. We could clarify how to set the error bars, or how to think more precisely about what “knowledge” and “reasoning abilities” mean. And our predictions depends on our prior over what knowledge and reasoning abilities models will have, which will be informed by a combination of estimates of algorithmic complexity of behaviors and bang-for-your-buck for different kinds of knowledge, but will ultimately depend on a lot of uncertain empirical facts about what kind of thing language models are able to learn. Overall I acknowledge you’d have to say a lot more to make this into something fully precise, and I’d guess the same will be true of a competing perspective.
I think this is roughly how many people make predictions about GPT-3, and in my experience it generally works pretty well and many apparent errors can be explained by more careful consideration of the training distribution. If we had a contest where you tried to give people short advice strings to help them predict GPT-3′s behavior, I think this kind of description would be an extremely strong entry.
This procedure is far from perfect. So you could imagine something else doing a lot better (or providing significant additional value as a complement).
I think it’s often helpful to talk about whether a policy which behaves badly (i.e. which selects actions have consequences we dislike, or like less than some alternative actions that it “could have” selected) will receive a high loss. One reason this is helpful is because we expect SGD to correct behaviors that lead to a high loss, given enough time, and so if bad behavior gets a high loss then we only need to think about optimization difficulties or cases where the failures can be catastrophic before SGD can correct them. Another reason this is helpful is that if we use SGD to select policies that get a low loss across a broad range of training environments, we have prima facie reason to expect the resulting policies to get a low loss in other environments.
I interpret you as making the claim (across this and your other recent posts): don’t expect policies to get a low loss just because they were selected for getting a low loss, instead think about how SGD steps will shape what they are “trying” to do and use that to reason directly about their generalization behavior. I think that’s also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models, or for designing competitive alternatives to those systems (or else the real meat is talking about what distinguishes future AI systems from current AI systems such that this analysis approach will become more appropriate later even if it’s less appropriate now). I’d be significantly more interested to see posts and experiments engaging with that.
As an aside: I think the physical implementation of the loss function is most relevant if your loss function is based on the reward that results from executing a given action (or on predicting actions based on their measured consequences). If your AI is instead trained via imitation learning or process-based feedback, the physical implementation of the loss does not seem to have special significance, e.g. SGD will not select policies that intervene in the physical world in order to change the physical implementation of their loss. (Though of course there can be policies that intervene in the physical world in order to change the dynamics of the learning process itself, and in some sense no physically implemented process could select for loss—a cognitive pattern that simply grabs resources by means unrelated to the loss function will ultimately be favored by the physical world no matter what training code you wrote down.)
Note that this is all less relevant to ARC because our goal is to find strategies for which we can’t tell a plausible story about why the resulting AI would tend to take creative and coherent actions to kill you. From our perspective, it’s evidently plausible for SGD to find a system that reasons explicitly about how to achieve a low loss (and so if such systems would kill you then that’s a problem), as well as plausible for SGD to find a system that behaves in a completely different way and can’t even be described as pursuing goals (so if you are relying on goal-directed behavior of any kind then that’s a problem). So we can’t really rely on any approaches like the ones you are either advocating or critiquing. Of course most people consider our goal completely hopeless, and I’m only saying this to clarify how ARC thinks about these things.
Yeah… I interpret TurnTrout as saying “look I know it seems straightforward to say that we are optimizing over policies rather than building policies that optimize for reward, but actually this difference is incredibly subtle”. And I think he’s right that this exact point has the kind of subtlety that just keeps biting again and again. I have the sense that this distinction held up evolutionary biology for decades.
Nevertheless, yes, as you say, the question is how to in fact reason from “policies selected according to such-and-such loss” to “any guarantees whatsoever about general behavior of policy”. I wish we could say more about why this part of the problem is so ferociously difficult.
Sure. To clarify, superior to what? “GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective”?
I’d describe the alternative perspective as: we try to think of GPT-3 as “knowing” some facts and having certain reasoning abilities. Then to predict how it behaves on a new input, we ask what the best next-token prediction is about the training distribution, given that knowledge and reasoning ability.
Of course the view isn’t “this is always what happens,” it’s a way of making a best guess. We could clarify how to set the error bars, or how to think more precisely about what “knowledge” and “reasoning abilities” mean. And our predictions depends on our prior over what knowledge and reasoning abilities models will have, which will be informed by a combination of estimates of algorithmic complexity of behaviors and bang-for-your-buck for different kinds of knowledge, but will ultimately depend on a lot of uncertain empirical facts about what kind of thing language models are able to learn. Overall I acknowledge you’d have to say a lot more to make this into something fully precise, and I’d guess the same will be true of a competing perspective.
I think this is roughly how many people make predictions about GPT-3, and in my experience it generally works pretty well and many apparent errors can be explained by more careful consideration of the training distribution. If we had a contest where you tried to give people short advice strings to help them predict GPT-3′s behavior, I think this kind of description would be an extremely strong entry.
This procedure is far from perfect. So you could imagine something else doing a lot better (or providing significant additional value as a complement).