Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how L(π) and ^L(π) depend on π0 (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when π=π0. Minor comment: I think n=m?
The term π is meant to be a posterior distribution after seeing data. If you have a good prior you could take π=π0. However, note L(π) could be high. You want trade-off between the cost of updating the prior and the loss reduction.
Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.
Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high πn0 probability (i.e., as you say, π0 is the data), while the learning bound or loss reduction is given for π.
I’m still confused about the part where you use the Hoeffding inequality—how is the lambda in that step and the lambda in the loss function “the same lambda”?
Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how L(π) and ^L(π) depend on π0 (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when π=π0. Minor comment: I think n=m?
The term π is meant to be a posterior distribution after seeing data. If you have a good prior you could take π=π0. However, note L(π) could be high. You want trade-off between the cost of updating the prior and the loss reduction.
Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.
(Btw thanks for the correction)
Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high πn0 probability (i.e., as you say, π0 is the data), while the learning bound or loss reduction is given for π.
I’m still confused about the part where you use the Hoeffding inequality—how is the lambda in that step and the lambda in the loss function “the same lambda”?
Because f=λ⋅ΔL. They are the same. Does that help?