The term π is meant to be a posterior distribution after seeing data. If you have a good prior you could take π=π0. However, note L(π) could be high. You want trade-off between the cost of updating the prior and the loss reduction.
Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.
Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high πn0 probability (i.e., as you say, π0 is the data), while the learning bound or loss reduction is given for π.
The term π is meant to be a posterior distribution after seeing data. If you have a good prior you could take π=π0. However, note L(π) could be high. You want trade-off between the cost of updating the prior and the loss reduction.
Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.
(Btw thanks for the correction)
Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high πn0 probability (i.e., as you say, π0 is the data), while the learning bound or loss reduction is given for π.