I think you’re perhaps reading me as being more bullish on Bayesian methods than I in fact am—I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if a Bayesian learner uses the NN prior, then its behaviour should—in the limit—be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:
It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.
And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I’m saying is that there are valid and well-motivated reasons to explore this particular direction.
Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something).
Do something else that makes this actually useful for “improving” OOD generalization in some way.
It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn’t stated any particular hope for (2) which depends on (1).
Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if the Bayesian aspect of the agenda did a specific useful thing like ‘”improve” OOD generalization’ or ‘allow us to control/understand OOD generalization’, then this aspect of the agenda would be useful.
However, I think the Bayesian aspect of the agend won’t do this and thus it won’t add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this—but I disagree and don’t see the story for this.
I agree that “actually use Bayesian methods” sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don’t think it clearly does.
(Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).)
However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head
1-3 don’t seem promising/important to me. (4) would be useful, but I don’t see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there.
To be clear, if the hope is “figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model”, then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.
I think you’re perhaps reading me as being more bullish on Bayesian methods than I in fact am—I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if a Bayesian learner uses the NN prior, then its behaviour should—in the limit—be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:
It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.
And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I’m saying is that there are valid and well-motivated reasons to explore this particular direction.
Insofar as the hope is:
Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something).
Do something else that makes this actually useful for “improving” OOD generalization in some way.
It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn’t stated any particular hope for (2) which depends on (1).
I agree that if the Bayesian aspect of the agenda did a specific useful thing like ‘”improve” OOD generalization’ or ‘allow us to control/understand OOD generalization’, then this aspect of the agenda would be useful.
However, I think the Bayesian aspect of the agend won’t do this and thus it won’t add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this—but I disagree and don’t see the story for this.
I agree that “actually use Bayesian methods” sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don’t think it clearly does.
(Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).)
1-3 don’t seem promising/important to me. (4) would be useful, but I don’t see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there.
To be clear, if the hope is “figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model”, then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.