we can think of Bayes’ Law as myopically optimizing per-hypothesis, uncaring of overall harm to predictive accuracy.
Or just bad implementations do this—predict-o-matic as described sounds like a bad idea, and like it doesn’t contain hypotheses, so much as “players”*. (And the reason there’d be a “side channel” is to understand theories—the point of which is transparency, which, if accomplished, would likely prevent manipulation.)
We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.
This seems a strange thing to imagine—how can fighting occur, especially on a training set? (I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it’s not clear where the “tickets” are.)
*I don’t have a link to the claim, but it’s been said before that ‘the math behind Bayes’ theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.′
Or just bad implementations do this—predict-o-matic as described sounds like a bad idea, and like it doesn’t contain hypotheses, so much as “players”*. (And the reason there’d be a “side channel” is to understand theories—the point of which is transparency, which, if accomplished, would likely prevent manipulation.)
You can think of the side-channel as a “bad implementation” issue, but do you really want to say that we have to forego diagnostic logs in order to have a good implementation of “hypotheses” instead of “players”? Going to the extreme, every brain has side-channels such as EEG.
But more importantly, as Daniel K pointed out, you don’t need the side-channel. If the predictions are being used in a complicated way to make decisions, the hypotheses/players have an incentive to fight each other through the consequences of those decisions.
So, the interesting question is, what’s necessary for a *good* implementation of this?
This seems a strange thing to imagine—how can fighting occur, especially on a training set?
If the training set doesn’t provide any opportunity for manipulation/corruption, then I agree that my argument isn’t relevant for the training set. It’s most directly relevant for online learning. However, keep in mind also that deep learning might be pushing in the direction of learning to learn. Something like a Memory Network is trained to “keep learning” in a significant sense. So you then have to ask if its learned learning strategy has these same issues, because that will be used on-line.
(I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it’s not clear where the “tickets” are.)
Simplifying the picture greatly, imagine that the second-back layer of neurons is one-neuron-per-ticket. Gradient descent can choose which of these to pay the most attention to, but little else; according to the lottery ticket hypothesis, the gradients passing through the ‘tickets’ themselves aren’t doing that much for learning, besides reinforcing good tickets and weakening bad.
So imagine that there is one ticket which is actually malign, and has a sophisticated manipulative strategy. Sometimes it passes on bad input in service of its manipulations, but overall it is the best of the lottery tickets so while the gradient descent punishes it on those rounds, it is more than made up for in other cases. Furthermore, the manipulations of the malign ticket ensure that competing tickets are kept down, by manipulating situations to be those which the other tickets don’t predict very well.
*I don’t have a link to the claim, but it’s been said before that ‘the math behind Bayes’ theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.′
This remark makes me think you’re thinking something about logical-induction style traders which only trade on a part of the data vs bayesian-style hypotheses which have to make predictions everywhere. I’m not sure how that relates to my post—there are things to say about it, but, I don’t think I said any of them. In particular the lottery-ticket hypothesis isn’t about this; a “lottery ticket” is a small part of the deep NN, but, is effectively a hypothesis about the whole data.
Or just bad implementations do this—predict-o-matic as described sounds like a bad idea, and like it doesn’t contain hypotheses, so much as “players”*. (And the reason there’d be a “side channel” is to understand theories—the point of which is transparency, which, if accomplished, would likely prevent manipulation.)
This seems a strange thing to imagine—how can fighting occur, especially on a training set? (I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it’s not clear where the “tickets” are.)
*I don’t have a link to the claim, but it’s been said before that ‘the math behind Bayes’ theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.′
You can think of the side-channel as a “bad implementation” issue, but do you really want to say that we have to forego diagnostic logs in order to have a good implementation of “hypotheses” instead of “players”? Going to the extreme, every brain has side-channels such as EEG.
But more importantly, as Daniel K pointed out, you don’t need the side-channel. If the predictions are being used in a complicated way to make decisions, the hypotheses/players have an incentive to fight each other through the consequences of those decisions.
So, the interesting question is, what’s necessary for a *good* implementation of this?
If the training set doesn’t provide any opportunity for manipulation/corruption, then I agree that my argument isn’t relevant for the training set. It’s most directly relevant for online learning. However, keep in mind also that deep learning might be pushing in the direction of learning to learn. Something like a Memory Network is trained to “keep learning” in a significant sense. So you then have to ask if its learned learning strategy has these same issues, because that will be used on-line.
Simplifying the picture greatly, imagine that the second-back layer of neurons is one-neuron-per-ticket. Gradient descent can choose which of these to pay the most attention to, but little else; according to the lottery ticket hypothesis, the gradients passing through the ‘tickets’ themselves aren’t doing that much for learning, besides reinforcing good tickets and weakening bad.
So imagine that there is one ticket which is actually malign, and has a sophisticated manipulative strategy. Sometimes it passes on bad input in service of its manipulations, but overall it is the best of the lottery tickets so while the gradient descent punishes it on those rounds, it is more than made up for in other cases. Furthermore, the manipulations of the malign ticket ensure that competing tickets are kept down, by manipulating situations to be those which the other tickets don’t predict very well.
This remark makes me think you’re thinking something about logical-induction style traders which only trade on a part of the data vs bayesian-style hypotheses which have to make predictions everywhere. I’m not sure how that relates to my post—there are things to say about it, but, I don’t think I said any of them. In particular the lottery-ticket hypothesis isn’t about this; a “lottery ticket” is a small part of the deep NN, but, is effectively a hypothesis about the whole data.