Ah, good point. It’s like the prior, considered as a regularizer, is too “soft” to encode the constraint we want.
A Bayesian could respond that we rarely actually want sparse solutions—in what situation is a physical parameter identically zero? -- but rather solutions which have many near-zeroes with high probability. The posterior would satisfy this I think. In this sense a Bayesian could justify the Laplace prior as approximating a so-called “slab-and-spike” prior (which I believe leads to combinatorial intractability similar to the fully L0 solution).
Also, without L0 the frequentist doesn’t get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.
[FWIW I like the logical view of probability, but don’t hold a strong Bayesian position. What seems most important to me is getting the semantics of both Bayesian (= conditional on the data) and frequentist (= unconditional, and dealing with the unknowns in some potentially nonprobabilistic way) statements right. Maybe there’d be less confusion—and more use of Bayes in science—if “inference” were reserved for the former and “estimation” for the latter.]
Also, without L0 the frequentist doesn’t get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.
See this comment. You actually do get sparse solutions in the scenario I proposed.
Ah, good point. It’s like the prior, considered as a regularizer, is too “soft” to encode the constraint we want.
A Bayesian could respond that we rarely actually want sparse solutions—in what situation is a physical parameter identically zero? -- but rather solutions which have many near-zeroes with high probability. The posterior would satisfy this I think. In this sense a Bayesian could justify the Laplace prior as approximating a so-called “slab-and-spike” prior (which I believe leads to combinatorial intractability similar to the fully L0 solution).
Also, without L0 the frequentist doesn’t get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.
[FWIW I like the logical view of probability, but don’t hold a strong Bayesian position. What seems most important to me is getting the semantics of both Bayesian (= conditional on the data) and frequentist (= unconditional, and dealing with the unknowns in some potentially nonprobabilistic way) statements right. Maybe there’d be less confusion—and more use of Bayes in science—if “inference” were reserved for the former and “estimation” for the latter.]
See this comment. You actually do get sparse solutions in the scenario I proposed.
Cool; I take that back. Sorry for not reading closely enough.