Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful.
This is NOT what the evidence supports, and super misleadingly phrased. (Either that, or it’s straightup magical thinking, which is worse)
The inductive biases / simplicity biases of deep learning are poorly understood but they almost certainly don’t have anything to do with what humans want, per se. (that would be basically magic) Rather, humans have gotten decent at intuiting them, such that humans can often predict how the neural network will generalize in response to such-and-such training data. i.e. human intuitive sense of simplicity is different, but not totally different, at least not always, from the actual simplicity biases at play.
Stylized abstract example: Our current AI is not generalizing in the way we wanted it to. Looking at its behavior, and our dataset, we intuit that the dataset D is narrow/nondiverse in ways Y and Z and that this could be causing the problem; we go collect more data so that our dataset is diverse in those ways, and try again, and this time it works (i.e. the AI generalizes to unseen data X). Why did this happen? Why didn’t it just overfit to the new dataset Dnew and fail at X? Because the simplicity biases were as we suspected they were—the model was indeed learning [nondiverse, overfitting-to-D policy] and not [desired policy] because of Y and Z related reasons and in the new training run Y and Z were fixed and the simplest policy was [desired policy] instead of [nondiverse, overfitting-to-Dnew policy], as we predicted & hoped it would be.
they almost certainly don’t have anything to do with what humans want, per se. (that would be basically magic)
We are obviously not appealing to literal telepathy or magic. Deep learning generalizes the way we want in part because we designed the architectures to be good, in part because human brains are built on similar principles to deep learning, and in part because we share a world with our deep learning models and are exposed to similar data.
Saying we design the architectures to be good is assuming away the problem. We design the architectures to be good according to a specific set of metrics (test loss, certain downstream task performance, etc). Problems like scheming are compatible with good performance on these metrics.
I think the argument about the similarity between human brains and the deep learning leading to good/nice/moral generalization is wrong. Human brains are way more similar to other natural brains which we would not say have nice generalization (e.g. the brains of bears or human psychopaths). One would need to make the argument that deep learning has certain similarities to human brains that these malign cases lack.
I’m not actually sure the scheming problems are “compatible” with good performance on these metrics, and even if they are, that doesn’t mean they’re likely or plausible given good performance on our metrics.
Human brains are way more similar to other natural brains
So I disagree with this, but likely because we are using different conceptions of similarity. In order to continue this conversation we’re going to need to figure out what “similar” means, because the term is almost meaningless in controversial cases— you can fill in whatever similarity metric you want. I used the term earlier as a shorthand for a more detailed story about randomly initialized singular statistical models learned with iterative, local update rules. I think both artificial and biological NNs fit that description, and this is an important notion of similarity.
This is NOT what the evidence supports, and super misleadingly phrased. (Either that, or it’s straightup magical thinking, which is worse)
The inductive biases / simplicity biases of deep learning are poorly understood but they almost certainly don’t have anything to do with what humans want, per se.
Seems like a misunderstanding. It seems to me that you are alleging that Nora/Quintin believe there is a causal arrow from “Humans want X generalization” to “NNs have X generalization”? If so, I think that’s an uncharitable reading of the quoted text.
I said “Either that, or it’s straightup magical thinking” which was referring to the causal arrow hypothesis. I agree it’s unlikely that they would endorse the causal arrow / magical thinking hypothesis, especially once it’s spelled out like that.
What do you think they meant by “Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful?”
I think they meant that there is an evidential update from “it’s economically useful” upwards on “this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far.”
Perhaps it’s easy to consider the same style of reasoning via: “The routes I take home from work are strongly biased towards being short, otherwise I wouldn’t have taken them home from work.”
Thanks. The routes-home example checks out IMO. Here’s another one that also seems to check out, which perhaps illustrates why I feel like the original claim is misleading/unhelpful/etc.: “The laws of ballistics strongly bias aerial projectiles towards landing on targets humans wanted to hit; otherwise, ranged weaponry wouldn’t be militarily useful.”
There’s a non-misleading version of this which I’d recommend saying instead, which is something like “Look we understand the laws of physics well enough and have played around with projectiles enough in practice, that we can reasonably well predict where they’ll land in a variety of situations, and design+aim weapons accordingly; if this wasn’t true then ranged weaponry wouldn’t be militarily useful.”
And I would endorse the corresponding claim for deep learning: “We understand how deep learning networks generalize well enough, and have played around with them enough in practice, that we can reasonably well predict how they’ll behave in a variety of situations, and design training environments accordingly; if this wasn’t true then deep learning wouldn’t be economically useful.”
(To which I’d reply “Yep and my current understanding of how they’ll behave in certain future scenarios is that they’ll powerseek, for reasons which others have explained… I have some ideas for other, different training environments that probably wouldn’t result in undesired behavior, but all of this is still pretty up in the air tbh I don’t think anyone really understands what they are doing here nearly as well as e.g. cannoneers in 1850 understood what they were doing.”)
To put it in terms of the analogy you chose: I agree (in a sense) that the routes you take home from work are strongly biased towards being short, otherwise you wouldn’t have taken them home from work. But if you tell me that today you are going to try out a new route, and you describe it to me and it seems to me that it’s probably going to be super long, and I object and say it seems like it’ll be super long for reasons XYZ, it’s not a valid reply for you to say “don’t worry, the routes I take home from work are strongly biased towards being short, otherwise I wouldn’t take them.” At least, it seems like a pretty confusing and maybe misleading thing to say. I would accept “Trust me on this, I know what I’m doing, I’ve got lots of experience finding short routes” I guess, though only half credit for that since it still wouldn’t be an object level reply to the reasons XYZ and in the absence of such a substantive reply I’d start to doubt your expertise and/or doubt that you were applying it correctly here (especially if I had an error theory for why you might be motivated to think that this route would be short even if it wasn’t.)
This is NOT what the evidence supports, and super misleadingly phrased. (Either that, or it’s straightup magical thinking, which is worse)
The inductive biases / simplicity biases of deep learning are poorly understood but they almost certainly don’t have anything to do with what humans want, per se. (that would be basically magic) Rather, humans have gotten decent at intuiting them, such that humans can often predict how the neural network will generalize in response to such-and-such training data. i.e. human intuitive sense of simplicity is different, but not totally different, at least not always, from the actual simplicity biases at play.
Stylized abstract example: Our current AI is not generalizing in the way we wanted it to. Looking at its behavior, and our dataset, we intuit that the dataset D is narrow/nondiverse in ways Y and Z and that this could be causing the problem; we go collect more data so that our dataset is diverse in those ways, and try again, and this time it works (i.e. the AI generalizes to unseen data X). Why did this happen? Why didn’t it just overfit to the new dataset Dnew and fail at X? Because the simplicity biases were as we suspected they were—the model was indeed learning [nondiverse, overfitting-to-D policy] and not [desired policy] because of Y and Z related reasons and in the new training run Y and Z were fixed and the simplest policy was [desired policy] instead of [nondiverse, overfitting-to-Dnew policy], as we predicted & hoped it would be.
We are obviously not appealing to literal telepathy or magic. Deep learning generalizes the way we want in part because we designed the architectures to be good, in part because human brains are built on similar principles to deep learning, and in part because we share a world with our deep learning models and are exposed to similar data.
Saying we design the architectures to be good is assuming away the problem. We design the architectures to be good according to a specific set of metrics (test loss, certain downstream task performance, etc). Problems like scheming are compatible with good performance on these metrics.
I think the argument about the similarity between human brains and the deep learning leading to good/nice/moral generalization is wrong. Human brains are way more similar to other natural brains which we would not say have nice generalization (e.g. the brains of bears or human psychopaths). One would need to make the argument that deep learning has certain similarities to human brains that these malign cases lack.
I’m not actually sure the scheming problems are “compatible” with good performance on these metrics, and even if they are, that doesn’t mean they’re likely or plausible given good performance on our metrics.
So I disagree with this, but likely because we are using different conceptions of similarity. In order to continue this conversation we’re going to need to figure out what “similar” means, because the term is almost meaningless in controversial cases— you can fill in whatever similarity metric you want. I used the term earlier as a shorthand for a more detailed story about randomly initialized singular statistical models learned with iterative, local update rules. I think both artificial and biological NNs fit that description, and this is an important notion of similarity.
Seems like a misunderstanding. It seems to me that you are alleging that Nora/Quintin believe there is a causal arrow from “Humans want X generalization” to “NNs have X generalization”? If so, I think that’s an uncharitable reading of the quoted text.
I said “Either that, or it’s straightup magical thinking” which was referring to the causal arrow hypothesis. I agree it’s unlikely that they would endorse the causal arrow / magical thinking hypothesis, especially once it’s spelled out like that.
What do you think they meant by “Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful?”
I think they meant that there is an evidential update from “it’s economically useful” upwards on “this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far.”
Perhaps it’s easy to consider the same style of reasoning via: “The routes I take home from work are strongly biased towards being short, otherwise I wouldn’t have taken them home from work.”
Thanks. The routes-home example checks out IMO. Here’s another one that also seems to check out, which perhaps illustrates why I feel like the original claim is misleading/unhelpful/etc.: “The laws of ballistics strongly bias aerial projectiles towards landing on targets humans wanted to hit; otherwise, ranged weaponry wouldn’t be militarily useful.”
There’s a non-misleading version of this which I’d recommend saying instead, which is something like “Look we understand the laws of physics well enough and have played around with projectiles enough in practice, that we can reasonably well predict where they’ll land in a variety of situations, and design+aim weapons accordingly; if this wasn’t true then ranged weaponry wouldn’t be militarily useful.”
And I would endorse the corresponding claim for deep learning: “We understand how deep learning networks generalize well enough, and have played around with them enough in practice, that we can reasonably well predict how they’ll behave in a variety of situations, and design training environments accordingly; if this wasn’t true then deep learning wouldn’t be economically useful.”
(To which I’d reply “Yep and my current understanding of how they’ll behave in certain future scenarios is that they’ll powerseek, for reasons which others have explained… I have some ideas for other, different training environments that probably wouldn’t result in undesired behavior, but all of this is still pretty up in the air tbh I don’t think anyone really understands what they are doing here nearly as well as e.g. cannoneers in 1850 understood what they were doing.”)
To put it in terms of the analogy you chose: I agree (in a sense) that the routes you take home from work are strongly biased towards being short, otherwise you wouldn’t have taken them home from work. But if you tell me that today you are going to try out a new route, and you describe it to me and it seems to me that it’s probably going to be super long, and I object and say it seems like it’ll be super long for reasons XYZ, it’s not a valid reply for you to say “don’t worry, the routes I take home from work are strongly biased towards being short, otherwise I wouldn’t take them.” At least, it seems like a pretty confusing and maybe misleading thing to say. I would accept “Trust me on this, I know what I’m doing, I’ve got lots of experience finding short routes” I guess, though only half credit for that since it still wouldn’t be an object level reply to the reasons XYZ and in the absence of such a substantive reply I’d start to doubt your expertise and/or doubt that you were applying it correctly here (especially if I had an error theory for why you might be motivated to think that this route would be short even if it wasn’t.)