The intent of the scenario is to find what model dominates, so probably loss should be non-negative. If you use squared error in that scenario, then the loss of the mixture is always greater than or equal to the loss of any particular model in the mixture.
I don’t see why that would necessarily be true. Say you have 3 data points from my Y=X+1 example from above:
(0,1)
(1,2)
(2,3)
And say the composite model is a weighted average of Y=X and Y=X+2 with equal weights (so just the regular average).
This means that the composite model outputs will be:
For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model. Furthermore, if you have a such a model, and one component is getting the answers exactly correct while the other isn’t, you can always get a lower loss by giving more weight to the component model with exactly correct answers. So I would a gradient descent process to do that.
I don’t think ML engineers will pass in weights of the models to the models themselves (except maybe for certain tasks like game-theoretic simulations). The worry is that data spills easily and that SGD might find absurd, unpredictable ways to sneak weights (or some other correlated variable) into the model.
From the description, it sounded to me like this instance of gradient descent is treating the outputs of the component models M− and M+ as features in a linear regression type problem.
In such a case, I would not expect data about the weights of each model to “spill” or in any way affect the output of either component model (unless the machine learning engineers are deliberately altering the data inputs depending on what the weights are, or something like that, and I see no reason why they would do that).
If it is a different situation—like if a neural net or some part or some layers of a neural net is a “gradient hacker” I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.
So barring some outside interference with the gradient descent process, I don’t see any concrete scenario of how gradient hacking could occur (unless the gradient hacking concept includes more mundane phenomena like “getting stuck in a local optimum”).
For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model.
Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would actually be a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).
I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.
I agree, but the question of in what direction SGD changes the model (i.e. how it changes f) seems to have some recursive element analogous to the situation above. If the model is really close to the f above, then I would imagine there’s some optimization pressure to update it towards f. That’s just a hunch, though. I don’t know how close it would have to be.
I don’t see why that would necessarily be true. Say you have 3 data points from my Y=X+1 example from above:
(0,1)
(1,2)
(2,3)
And say the composite model is a weighted average of Y=X and Y=X+2 with equal weights (so just the regular average).
This means that the composite model outputs will be:
Y=(FirstComponentOutput)+(SecondComponentOutput)2=X+(X+2)2=2X+22=X+1
Thus the composite model would be right on the line, and get each data point Y-value exactly right (and have 0 loss).
The squared error loss would be:
TotalLoss=(ModelOutput(0)−1)2+(ModelOutput(1)−2)2+(ModelOutput(2)−3)2
=((0+1)−1)2+((1+1)−2)2+((2+1)−3)2=0
By contrast, each of the two component models would have a total squared error of 3 for these 3 data points.
The Y=X component model would have total squared error loss of:
TotalLoss=(ModelOutput(0)−1)2+(ModelOutput(1)−2)2+(ModelOutput(2)−3)2
=(0−1)2+(1−2)2+(2−3)2=3
The Y=X + 2 component model would have total squared error loss of:
TotalLoss=(ModelOutput(0)−1)2+(ModelOutput(1)−2)2+(ModelOutput(2)−3)2
=((0+2)−1)2+((1+2)−2)2+((2+2)−3)2=3
For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model. Furthermore, if you have a such a model, and one component is getting the answers exactly correct while the other isn’t, you can always get a lower loss by giving more weight to the component model with exactly correct answers. So I would a gradient descent process to do that.
From the description, it sounded to me like this instance of gradient descent is treating the outputs of the component models M− and M+ as features in a linear regression type problem.
In such a case, I would not expect data about the weights of each model to “spill” or in any way affect the output of either component model (unless the machine learning engineers are deliberately altering the data inputs depending on what the weights are, or something like that, and I see no reason why they would do that).
If it is a different situation—like if a neural net or some part or some layers of a neural net is a “gradient hacker” I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.
So barring some outside interference with the gradient descent process, I don’t see any concrete scenario of how gradient hacking could occur (unless the gradient hacking concept includes more mundane phenomena like “getting stuck in a local optimum”).
Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would actually be a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).
I agree, but the question of in what direction SGD changes the model (i.e. how it changes f) seems to have some recursive element analogous to the situation above. If the model is really close to the f above, then I would imagine there’s some optimization pressure to update it towards f. That’s just a hunch, though. I don’t know how close it would have to be.