Yes, I mixed up x and y, good catch. It’s not trivial for me to fix this while maintaining wordpress-compatibility, but I’ll try to do so in the next few days.
Many L1 constraint-based algorithms (for example the LASSO) can be interpreted as producing maximum a posteriori Bayesian point estimates with Laplace (= double exponential) priors on the coefficients.
Yes, but in this setting maximum a posteriori (MAP) doesn’t make any sense from a Bayesian perspective. Maximum a posteriori is supposed to be a point estimate of the posterior, but in this case, the MAP solution will be sparse, whereas the posterior given a laplacian prior will place zero mass on sparse solutions. So the MAP estimate doesn’t even qualitatively approximate the posterior.
Ah, good point. It’s like the prior, considered as a regularizer, is too “soft” to encode the constraint we want.
A Bayesian could respond that we rarely actually want sparse solutions—in what situation is a physical parameter identically zero? -- but rather solutions which have many near-zeroes with high probability. The posterior would satisfy this I think. In this sense a Bayesian could justify the Laplace prior as approximating a so-called “slab-and-spike” prior (which I believe leads to combinatorial intractability similar to the fully L0 solution).
Also, without L0 the frequentist doesn’t get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.
[FWIW I like the logical view of probability, but don’t hold a strong Bayesian position. What seems most important to me is getting the semantics of both Bayesian (= conditional on the data) and frequentist (= unconditional, and dealing with the unknowns in some potentially nonprobabilistic way) statements right. Maybe there’d be less confusion—and more use of Bayes in science—if “inference” were reserved for the former and “estimation” for the latter.]
Also, without L0 the frequentist doesn’t get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.
See this comment. You actually do get sparse solutions in the scenario I proposed.
Okay, I’m somewhat leaving my expertise here and going on intuition, but I would be somewhat surprised if the problem exactly as you stated it turned out to be solvable by a compressed-sensing algorithm as roughly described on Wikipedia. I was trying to figure out how I’d approach the problem you stated, using techniques I already knew about, but it seemed to me more like a logical constraint problem than a stats problem, because we had to end up with exactly 100 nonzero coefficients and the 100 coefficients had to exactly fit the observations y, in what I assume to be an underdetermined problem when treated as a linear problem. (In fact, my intuitions were telling me that this ought to correspond to some kind of SAT problem and maybe be NP-hard.) Am I wrong? The Wikipedia description talks about using L1-norm style techniques to implement an “almost all coefficients are 0” norm, aka “L0 norm”, but it doesn’t actually say the exact # of coefficients are known, nor that the observations are presumed to be noiseless.
You minimize the L1-norm consistently with correct prediction on all the training examples. Because of the way the training examples were generated, this will yield at most 100 non-zero coefficients.
It can be proved that problem is solvable in polynomial time due to a reduction to linear programming: let m = 10,000
You can further manipulate it to get rid of the absolute value. For each coefficient introduce two variables: overset{}{u}_j and
It says that if the L0-pseudonorm solution has an error epsilon, then the L1-norm solution has error up to C*epsilon, for some positive C. In the exact case, epsilon is zero, hence the two solutions are equivalent.
Also Jacob originally specified that the coefficients were drawn from a Gaussian and nobody seems to be using that fact.
You don’t really need the fact for the exact case. In the inexact case, you can use it in the form of an additional L2-norm regularization.
Note that in the inexact case (i.e. observation error) this model (the Lasso) fits comfortably in a Bayesian framework. (Double exponential prior on u.) Leon already made this point below and jsteinhardt replied
As long as the matrix formed by the x_i satisfies the “restricted isometry property”, the optimization problem given by V_V above will recover the optimal solution. For m >> k*log(n), (where in this case m = 10,000, k = 100, and n = 10^6), a random Gaussian matrix will satisfy this property with overwhelmingly large probability. I should probably have checked that the particular numbers I gave will work out, although I’m pretty sure they will, and if not we can replace 10,000 with 20,000 and the point still stands.
My impression of compressed sensing is that you have a problem where you don’t know it’s exactly 100 coefficients- you’re looking for the result that uses the minimum number of coefficients. Looking for the minimum takes more work than looking for one that’s exactly 100, and you’re right that this is binary optimization, which is a giant pain.
The beautiful compressed sensing result is that for a broad class of problems, the solution to the L1-norm minimum problem- which is much easier to solve- is also the solution to the L0-norm minimum problem. (For anyone not familiar, the L1 norm means “add the absolute value of the coefficients,” and the L0 norm means “add the number of non-zero coefficients,” which is not actually a norm.) In this particular context, I would look for the L1 minimum norm- and either it’s less than or equal to 100 coefficients, in which case I’m okay, or it’s more than 100 coefficients, in which case (so long as the conditions of the compressed sensing theorem apply) I know the problem as stated is infeasible, and I’ve found the least infeasible solution.
Compressed sensing in signal processing may help give an intuitive overview of what’s going on. Consider a stream of data samples S, collected periodically at a rate R samples per second. According to the sampling theorem, we can perfectly reconstruct a continuous signal from those samples up to a frequency of rate 1/(2R). So for samples taken every 1/100th of a second, we can perfectly reconstruct signals from 0 to 50 hz.
Now, take two different subsets of S and compare them:
if we take a subset of S at fixed intervals, we can only reconstruct part of the original signal perfectly. The part that we can reconstruct is at a lower frequency: for example, if we take every other sample from S, we can only reconstruct perfectly up to a limit of 25 hz. Frequencies above that cannot be uniquely determined.
if we take a completely random subset of S, we can reconstruct the entire frequency range, but we can not reconstruct it perfectly. This is similar to holography, where cutting the hologram in half still reproduces the whole image, but at lower resolution.
We use fixed subsampling when we know our signal is band limited, and that we can safely throw away some band of frequencies.
The use of random sampling (compressed sensing) is when we have a signal that is not band limited, but is sparse enough that we can still accurately describe using a smaller number of data points. The more frequency sparse the signal is, the more accurate our estimate will be. For most compressed sensing applications, the signal can be very sparse indeed, and the number of needed samples can be quite low.
Looking for the minimum takes more work than looking for one that’s exactly 100, and you’re right that this is binary optimization, which is a giant pain.
I think I was thinking of it as including m binary variables, which I’ll call z_j, which indicate whether or not the u_j are nonzero, and then an L0 minimization is that the cost function is the sum of the z_j or the L0 constraint is a constraint that their sum must equal 100. (Standard caveat than L0 is not actually a norm, which I should edit in to the previous post.)
The SAT problem Eliezer mentioned is binary (well, boolean), and it felt awkward to claim that it’s a SAT problem when that could be mistaken as a question on the otherSAT, so I decided to switch names without moving too far in concept-space. Your explanation seems much neater than mine, though.
[edit]I should be clear that I mean that looking for the L0 minimization directly is at least as hard as doing it the binary optimization way, but that the L1 minimization problem, which indirectly solves the L0 minimization problem only sometimes (read: almost always), is polynomial time, i.e. much easier to solve, because it’s linear.
I should be clear that I mean that looking for the L0 minimization directly is at least as hard as doing it the binary optimization way, but that the L1 minimization problem, which indirectly solves the L0 minimization problem only sometimes (read: almost always) is polynomial time, i.e. much easier to solve, because it’s linear.
I assume you mean y_n=uTx_n in the “infer u” problem? Or am I missing something?
Also, is there a good real-world problem which this reflects?
Yes, I mixed up x and y, good catch. It’s not trivial for me to fix this while maintaining wordpress-compatibility, but I’ll try to do so in the next few days.
This problem is called the “compressed sensing” problem and is most famously used to speed up MRI scans. However it has also had a multitude of other applications, see here: http://en.wikipedia.org/wiki/Compressed_sensing#Applications.
Many L1 constraint-based algorithms (for example the LASSO) can be interpreted as producing maximum a posteriori Bayesian point estimates with Laplace (= double exponential) priors on the coefficients.
Yes, but in this setting maximum a posteriori (MAP) doesn’t make any sense from a Bayesian perspective. Maximum a posteriori is supposed to be a point estimate of the posterior, but in this case, the MAP solution will be sparse, whereas the posterior given a laplacian prior will place zero mass on sparse solutions. So the MAP estimate doesn’t even qualitatively approximate the posterior.
Ah, good point. It’s like the prior, considered as a regularizer, is too “soft” to encode the constraint we want.
A Bayesian could respond that we rarely actually want sparse solutions—in what situation is a physical parameter identically zero? -- but rather solutions which have many near-zeroes with high probability. The posterior would satisfy this I think. In this sense a Bayesian could justify the Laplace prior as approximating a so-called “slab-and-spike” prior (which I believe leads to combinatorial intractability similar to the fully L0 solution).
Also, without L0 the frequentist doesn’t get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.
[FWIW I like the logical view of probability, but don’t hold a strong Bayesian position. What seems most important to me is getting the semantics of both Bayesian (= conditional on the data) and frequentist (= unconditional, and dealing with the unknowns in some potentially nonprobabilistic way) statements right. Maybe there’d be less confusion—and more use of Bayes in science—if “inference” were reserved for the former and “estimation” for the latter.]
See this comment. You actually do get sparse solutions in the scenario I proposed.
Cool; I take that back. Sorry for not reading closely enough.
Okay, I’m somewhat leaving my expertise here and going on intuition, but I would be somewhat surprised if the problem exactly as you stated it turned out to be solvable by a compressed-sensing algorithm as roughly described on Wikipedia. I was trying to figure out how I’d approach the problem you stated, using techniques I already knew about, but it seemed to me more like a logical constraint problem than a stats problem, because we had to end up with exactly 100 nonzero coefficients and the 100 coefficients had to exactly fit the observations y, in what I assume to be an underdetermined problem when treated as a linear problem. (In fact, my intuitions were telling me that this ought to correspond to some kind of SAT problem and maybe be NP-hard.) Am I wrong? The Wikipedia description talks about using L1-norm style techniques to implement an “almost all coefficients are 0” norm, aka “L0 norm”, but it doesn’t actually say the exact # of coefficients are known, nor that the observations are presumed to be noiseless.
You minimize the L1-norm consistently with correct prediction on all the training examples. Because of the way the training examples were generated, this will yield at most 100 non-zero coefficients.
It can be proved that problem is solvable in polynomial time due to a reduction to linear programming:
let m = 10,000
You can further manipulate it to get rid of the absolute value. For each coefficient introduce two variables: overset{ }{u}_j and
:Further reading shows that http://en.wikipedia.org/wiki/Sparse_approximation is supposed to be NP-hard, so it can’t be that the L1-norm minimum produces the “L0-norm” minimum every time.
http://statweb.stanford.edu/~donoho/Reports/2004/l1l0approx.pdf which is given as the Wikipedia reference for L1 producing L0 under certain conditions, only talks about near-solutions, not exact solutions.
Also Jacob originally specified that the coefficients were drawn from a Gaussian and nobody seems to be using that fact.
Yes, but if I understand correctly it occurs with probability 1 for many classes of probability distributions (including this one, I think).
It says that if the L0-pseudonorm solution has an error epsilon, then the L1-norm solution has error up to C*epsilon, for some positive C. In the exact case, epsilon is zero, hence the two solutions are equivalent.
You don’t really need the fact for the exact case. In the inexact case, you can use it in the form of an additional L2-norm regularization.
Note that in the inexact case (i.e. observation error) this model (the Lasso) fits comfortably in a Bayesian framework. (Double exponential prior on u.) Leon already made this point below and jsteinhardt replied
As long as the matrix formed by the x_i satisfies the “restricted isometry property”, the optimization problem given by V_V above will recover the optimal solution. For m >> k*log(n), (where in this case m = 10,000, k = 100, and n = 10^6), a random Gaussian matrix will satisfy this property with overwhelmingly large probability. I should probably have checked that the particular numbers I gave will work out, although I’m pretty sure they will, and if not we can replace 10,000 with 20,000 and the point still stands.
My impression of compressed sensing is that you have a problem where you don’t know it’s exactly 100 coefficients- you’re looking for the result that uses the minimum number of coefficients. Looking for the minimum takes more work than looking for one that’s exactly 100, and you’re right that this is binary optimization, which is a giant pain.
The beautiful compressed sensing result is that for a broad class of problems, the solution to the L1-norm minimum problem- which is much easier to solve- is also the solution to the L0-norm minimum problem. (For anyone not familiar, the L1 norm means “add the absolute value of the coefficients,” and the L0 norm means “add the number of non-zero coefficients,” which is not actually a norm.) In this particular context, I would look for the L1 minimum norm- and either it’s less than or equal to 100 coefficients, in which case I’m okay, or it’s more than 100 coefficients, in which case (so long as the conditions of the compressed sensing theorem apply) I know the problem as stated is infeasible, and I’ve found the least infeasible solution.
Compressed sensing in signal processing may help give an intuitive overview of what’s going on. Consider a stream of data samples S, collected periodically at a rate R samples per second. According to the sampling theorem, we can perfectly reconstruct a continuous signal from those samples up to a frequency of rate 1/(2R). So for samples taken every 1/100th of a second, we can perfectly reconstruct signals from 0 to 50 hz.
Now, take two different subsets of S and compare them:
if we take a subset of S at fixed intervals, we can only reconstruct part of the original signal perfectly. The part that we can reconstruct is at a lower frequency: for example, if we take every other sample from S, we can only reconstruct perfectly up to a limit of 25 hz. Frequencies above that cannot be uniquely determined.
if we take a completely random subset of S, we can reconstruct the entire frequency range, but we can not reconstruct it perfectly. This is similar to holography, where cutting the hologram in half still reproduces the whole image, but at lower resolution.
We use fixed subsampling when we know our signal is band limited, and that we can safely throw away some band of frequencies.
The use of random sampling (compressed sensing) is when we have a signal that is not band limited, but is sparse enough that we can still accurately describe using a smaller number of data points. The more frequency sparse the signal is, the more accurate our estimate will be. For most compressed sensing applications, the signal can be very sparse indeed, and the number of needed samples can be quite low.
Why binary optimization?
I think I was thinking of it as including m binary variables, which I’ll call z_j, which indicate whether or not the u_j are nonzero, and then an L0 minimization is that the cost function is the sum of the z_j or the L0 constraint is a constraint that their sum must equal 100. (Standard caveat than L0 is not actually a norm, which I should edit in to the previous post.)
The SAT problem Eliezer mentioned is binary (well, boolean), and it felt awkward to claim that it’s a SAT problem when that could be mistaken as a question on the other SAT, so I decided to switch names without moving too far in concept-space. Your explanation seems much neater than mine, though.
[edit]I should be clear that I mean that looking for the L0 minimization directly is at least as hard as doing it the binary optimization way, but that the L1 minimization problem, which indirectly solves the L0 minimization problem only sometimes (read: almost always), is polynomial time, i.e. much easier to solve, because it’s linear.
Ok.