FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it’s very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn’t necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).
I think you’d better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall ‘serial depth’ thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.
I’d love to know what you’re referring to by this:
evolution… is fine with a mutation that leads to 10^7 serial ops if it’s metabolic costs are low.
Also,
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
I think the jury is still out on this, but there’s literature on it (probably much more I haven’t fished out). [EDIT: also see this comment which has some other examples]
AFAIK there’s no evidence of this and it would be somewhat surprising to find it playing a major role. Then again, I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent.
I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is.
I mean, does it matter? What if it turns out that gradient descent itself doesn’t affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn’t an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?
FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it’s very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn’t necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall ‘serial depth’ thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.
I’d love to know what you’re referring to by this:
Also,
I think the jury is still out on this, but there’s literature on it (probably much more I haven’t fished out). [EDIT: also see this comment which has some other examples]
AFAIK there’s no evidence of this and it would be somewhat surprising to find it playing a major role. Then again, I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent.
I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.
I mean, does it matter? What if it turns out that gradient descent itself doesn’t affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn’t an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?