V_V comments on A Fervent Defense of Frequentist Statistics

V_V 12 Feb 2014 20:39 UTC
8 points
You minimize the L1-norm consistently with correct prediction on all the training examples. Because of the way the training examples were generated, this will yield at most 100 non-zero coefficients.

It can be proved that problem is solvable in polynomial time due to a reduction to linear programming:
let m = 10,000

$\\begin\{aligned\} & \\text\{miminze\} & & \\sum\_\{j=1\}^m |u\_j| \\\\ & \\text\{subject to\} & & u^T x\_i = y\_i, \\; i=1,\\dots,n \\end\{aligned\}$

You can further manipulate it to get rid of the absolute value. For each coefficient introduce two variables: $overset{ }{u}_j$ and
:

$\\begin\{aligned\} & \\text\{miminze\} & & \\sum\_\{j=1\}^m \\overset\{\+\}\{u\}\_j \+ \\overset\{\-\}\{u\}\_j \\\\ & \\text\{subject to\} & & u^T x\_i = y\_i, \\; i=1,\\dots,n\\\\ &&& u\_j = \\overset\{\+\}\{u\}\_j \- \\overset\{\-\}\{u\}\_j, \\; j=1,\\dots,m\\\\ &&& \\overset\{\+\}\{u\}\_j \\geq 0, \\; j=1,\\dots,m\\\\ &&& \\overset\{\-\}\{u\}\_j \\geq 0, \\; j=1,\\dots,m \\end\{aligned\}$
What links here?
- Vaniver's comment on A Fervent Defense of Frequentist Statistics by jsteinhardt (12 Feb 2014 21:24 UTC; 0 points)
- Eliezer Yudkowsky 12 Feb 2014 21:14 UTC
  0 points
  Parent
  Further reading shows that http://en.wikipedia.org/wiki/Sparse_approximation is supposed to be NP-hard, so it can’t be that the L1-norm minimum produces the “L0-norm” minimum every time.
  
  http://statweb.stanford.edu/~donoho/Reports/2004/l1l0approx.pdf which is given as the Wikipedia reference for L1 producing L0 under certain conditions, only talks about near-solutions, not exact solutions.
  
  Also Jacob originally specified that the coefficients were drawn from a Gaussian and nobody seems to be using that fact.
  - V_V 12 Feb 2014 21:50 UTC
    6 points
    Parent
    
    Further reading shows that http://en.wikipedia.org/wiki/Sparse_approximation is supposed to be NP-hard, so it can’t be that the L1-norm minimum produces the “L0-norm” minimum every time.
    
    Yes, but if I understand correctly it occurs with probability 1 for many classes of probability distributions (including this one, I think).
    
    http://statweb.stanford.edu/~donoho/Reports/2004/l1l0approx.pdf which is given as the Wikipedia reference for L1 producing L0 under certain conditions, only talks about near-solutions, not exact solutions.
    
    It says that if the L0-pseudonorm solution has an error epsilon, then the L1-norm solution has error up to C*epsilon, for some positive C. In the exact case, epsilon is zero, hence the two solutions are equivalent.
    
    Also Jacob originally specified that the coefficients were drawn from a Gaussian and nobody seems to be using that fact.
    
    You don’t really need the fact for the exact case. In the inexact case, you can use it in the form of an additional L2-norm regularization.
    - sflicht 19 Feb 2014 1:24 UTC
      0 points
      Parent
      Note that in the inexact case (i.e. observation error) this model (the Lasso) fits comfortably in a Bayesian framework. (Double exponential prior on u.) Leon already made this point below and jsteinhardt replied
  - jsteinhardt 13 Feb 2014 5:35 UTC
    4 points
    Parent
    As long as the matrix formed by the x_i satisfies the “restricted isometry property”, the optimization problem given by V_V above will recover the optimal solution. For m >> k*log(n), (where in this case m = 10,000, k = 100, and n = 10^6), a random Gaussian matrix will satisfy this property with overwhelmingly large probability. I should probably have checked that the particular numbers I gave will work out, although I’m pretty sure they will, and if not we can replace 10,000 with 20,000 and the point still stands.
    What links here?
    jsteinhardt's comment on A Fervent Defense of Frequentist Statistics by jsteinhardt (16 Feb 2014 22:27 UTC; 1 point)