I’m not sure what exactly you mean by “retreat to malign generalisation”.
When you don’t have a deep understanding of a phenomenon, it’s common to use some empirical description of what you’re talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you’re relying too heavily on the model being internally correct.
Therefore, until we gain a deeper understanding, it’s better to use the pre-theoretical description of what we’re talking about. I’m assuming that’s what Rohin meant by “retreat to malign generalization.”
This is important because if we used the definition given in the paper, then this could affect which approaches we use to address inner alignment. For instance, we could try using some interpretability technique to discover the “objective” that a neural network was maximizing. But if our model of the neural network as an optimizer is ultimately incorrect, then the neural network won’t have an explicit objective, making this technique very difficult.
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
When you don’t have a deep understanding of a phenomenon, it’s common to use some empirical description of what you’re talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you’re relying too heavily on the model being internally correct.
Therefore, until we gain a deeper understanding, it’s better to use the pre-theoretical description of what we’re talking about. I’m assuming that’s what Rohin meant by “retreat to malign generalization.”
This is important because if we used the definition given in the paper, then this could affect which approaches we use to address inner alignment. For instance, we could try using some interpretability technique to discover the “objective” that a neural network was maximizing. But if our model of the neural network as an optimizer is ultimately incorrect, then the neural network won’t have an explicit objective, making this technique very difficult.
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.