I think I understand these points, and I don’t see how this contradicts what I’m saying. I’ll try rewording.
Consider the following gaussian process:
Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don’t have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor.
Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven’t generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD.
You’ve argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. Myclaim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Sincemost generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn’t load bearing on this point—the related load bearing assumption is the lack of a an inductive bias that is benign.
If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don’t have to worry about this, but as yet, I don’t see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases.
It feels to me as if your argument is that we understand neither evolution nor NN inductive biases, and so we can’t make strong predictions about OOD generalization, so we are left with our high uncertainty prior over all of the possible proxies that we could find. It seems to me that we are far from being able to argue things like “because of inductive bias from the NN architecture, we’ll get non-deceptive AIs, even if there is a deceptive basin in the loss landscape that could get higher reward.”
I suspect you think bad misgeneralization happens only when you have a two layer selection process (and this is especially sharp when there’s a large time disparity between these processes), like evolution setting up the human within lifetime learning. I don’t see why you think that these types of functions would be more likely to misgeneralize.
(only responding to the first part of your comment now, may add on additional content later)
Thanks for the response!
I think I understand these points, and I don’t see how this contradicts what I’m saying. I’ll try rewording.
Consider the following gaussian process:
Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don’t have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor.
Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven’t generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD.
You’ve argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. My claim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Since most generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn’t load bearing on this point—the related load bearing assumption is the lack of a an inductive bias that is benign.
If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don’t have to worry about this, but as yet, I don’t see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases.
It feels to me as if your argument is that we understand neither evolution nor NN inductive biases, and so we can’t make strong predictions about OOD generalization, so we are left with our high uncertainty prior over all of the possible proxies that we could find. It seems to me that we are far from being able to argue things like “because of inductive bias from the NN architecture, we’ll get non-deceptive AIs, even if there is a deceptive basin in the loss landscape that could get higher reward.”
I suspect you think bad misgeneralization happens only when you have a two layer selection process (and this is especially sharp when there’s a large time disparity between these processes), like evolution setting up the human within lifetime learning. I don’t see why you think that these types of functions would be more likely to misgeneralize.
(only responding to the first part of your comment now, may add on additional content later)