In Magna Alta Doctrina Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs
While that approach is potentially interesting by itself, it’s probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a multiplicative weight update rather than an additive weight update as in standard SGD.
Has this been tried/evaluated? Why actually yes—it’s called exponentiated gradient descent, as exponentiating the result of additive updates is equivalent to multiplicative updates. And intriguingly, for certain ‘sparse’ input distributions the convergence or total error of EGD/MGD is logarithmic rather than the typical inverse polynomial of AGD (additive gradient descent): O(logN) vs O(1/N) or O(1/N2), and fits naturally with ‘throw away half the theories per observation’.
The situations where EGD outperforms AGD, or vice versa, depend on the input distribution: if it’s more normal then AGD wins, if it’s more sparse log-normal then EGD wins. The morale of the story is there isn’t one single simple update rule that always maximizes convergence/performance; it all depends on the data distribution (a key insight from bayesian analysis).
The exponential/multiplicative update is correct in Solomonoff’s use case because the different sub-models are strictly competing rather than cooperating: we assume a single correct theory can explain the data, and predict through an ensemble of sub-models. But we should expect that learned cooperation is also important—and more specifically if you look more deeply down the layers of a deeply factored net at where nodes representing sub-computations are more heavily shared, it perhaps argues for cooperative components.
My read of this is we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.
This makes some sense. If you don’t have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories does make striking predictions far different from your other theories, its probably wrong, just because striking things don’t really happen.
In contrast, if you need to regularly deal with extreme scenarios, you need theories capable of generalizing to those extreme scenarios, which means not throwing out theories for making striking or weird predictions. Striking events end up being common, so its less an indictment.
But there are also reasons to think this is wrong. Hits based entrepreneurship approaches for example seem to be more foxy than standard quantitative or investment finance, and hits based entrepreneurship works precisely because the distribution of outcomes for companies is long-tailed.
In some sense the difference between the two is a “sin of omission” vs a “sin of commission” disagreement between the two approaches, where the hits-based approach needs to see how something could go right, while the standard finance approach needs to see how something could go wrong. So its not so much a predictive disagreement between the two approaches, but more a decision theory or comparative advantage difference.
In Magna Alta Doctrina Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs
My read of this is we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.
This makes some sense. If you don’t have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories does make striking predictions far different from your other theories, its probably wrong, just because striking things don’t really happen.
In contrast, if you need to regularly deal with extreme scenarios, you need theories capable of generalizing to those extreme scenarios, which means not throwing out theories for making striking or weird predictions. Striking events end up being common, so its less an indictment.
But there are also reasons to think this is wrong. Hits based entrepreneurship approaches for example seem to be more foxy than standard quantitative or investment finance, and hits based entrepreneurship works precisely because the distribution of outcomes for companies is long-tailed.
In some sense the difference between the two is a “sin of omission” vs a “sin of commission” disagreement between the two approaches, where the hits-based approach needs to see how something could go right, while the standard finance approach needs to see how something could go wrong. So its not so much a predictive disagreement between the two approaches, but more a decision theory or comparative advantage difference.