Choosing better sparsity penalties than L1 (Upcoming post - Ben Wright & Lee Sharkey): [...] We propose a simple fix: Use L0<p<1 instead of L1, which seems to be a Pareto improvement over L1
Is there any particular justification for using Lp rather than, e.g., tanh (cf Anthropic’s Feb update), log1psum (acts.log1p().sum()), or prod1p (acts.log1p().sum().exp())? The agenda I’m pursuing (write-up in progress) gives theoretical justification for a sparsity penalty that explodes combinatorially in the number of active features, in any case where the downstream computation performed over the feature does not distribute linearly over features. The product-based sparsity penalty seems to perform a bit better than both L0.5 and tanh on a toy example (sample size 1), see this colab.
Is there any particular justification for using Lp rather than, e.g., tanh (cf Anthropic’s Feb update), log1psum (acts.log1p().sum()), or prod1p (acts.log1p().sum().exp())? The agenda I’m pursuing (write-up in progress) gives theoretical justification for a sparsity penalty that explodes combinatorially in the number of active features, in any case where the downstream computation performed over the feature does not distribute linearly over features. The product-based sparsity penalty seems to perform a bit better than both L0.5 and tanh on a toy example (sample size 1), see this colab.