Brains and backprop: a key timeline crux
[Crossposted from my blog]
The Secret Sauce Question
Human brains still outperform deep learning algorithms in a wide variety of tasks, such as playing soccer or knowing that it’s a bad idea to drive off a cliff without having to try first (for more formal examples, see Lake et al., 2017; Hinton, 2017; LeCun, 2018; Irpan, 2018). This fact can be taken as evidence for two different hypotheses:
In order to develop human-level AI, we have to develop entirely new learning algorithms. At the moment, AI is a deep conceptual problem.
In order to develop human-level AI, we basically just have to improve current deep learning algorithms (and their hardware) a lot. At the moment, AI is an engineering problem.
The question of which of these views is right I call “the secret sauce question”.
The secret sauce question seems like one of the most important considerations in estimating how long there is left until the development of human-level artificial intelligence (“timelines”). If something like 2) is true, timelines are arguably substantially shorter than if something like 1) is true [1].
However, it seems initially difficult to arbitrate these two vague, high-level views. It appears as if though an answer requires complicated inside views stemming from deep and wide knowledge of current technical AI research. This is partly true. Yet this post proposes that there might also be single, concrete discovery capable of settling the secret sauce question: does the human brain learn using gradient descent, by implementing backpropagation?
The importance of backpropagation
Underlying the success of modern deep learning is a single algorithm: gradient descent with backpropagation of error (LeCun et al., 2015). In fact, the majority of research is not focused on finding better algorithms, but rather on finding better cost functions to descend using this algorithm (Marblestone et al., 2016). Yet, in stark contrast to this success, since the 1980’s the key objection of neuroscientists to deep learning has been that backpropagation is not biologically plausible (Crick, 1989; Stork, 1989).
As a result, the question of whether the brain implements backpropagation provides critical evidence on the secret sauce problem. If the brain does not use it, and still outperforms deep learning while running on the energy of a laptop and training on several orders of magnitude fewer training examples than parameters, this suggests that a deep conceptual advance is necessary to build human-level artificial intelligence. There’s some other remarkable algorithm out there, and evolution found it. But if the brain does use backprop, then the reason deep learning works so well is because it’s somehow on the right track. Human researchers and evolution converged on a common solution to the problem of optimising large networks of neuron-like units. (These arguments assume that if a solution is biologically plausible and the best solution available, then it would have evolved).
Actually, the situation is a bit more nuanced than this, and I think it can be clarified by distinguishing between algorithms that are:
Biologically actual: What the brain actually does.
Biologically plausible: What the brain might have done, while still being restricted by evolutionary selection pressure towards energy efficiency etc.
For example, humans walk with legs, but it seems possible that evolution might have given us wings or fins instead, as those solutions work for other animals. However, evolution could not have given us wheels, as that requires a separable axle and wheel, and it’s unclear what an evolutionary path to an organism with two separable parts looks like (excluding symbiotic relationships).
Biologically possible: What is technically possible to do with collections of cells, regardless of its relative evolutionary advantage.
For example, even though evolving wheels is implausible, there might be no inherent problem with an organism having wheels (created by “God”, say), in the way in which there’s an inherent problem with an organism’s axons sending action potentials faster than the speed of light.
I think this leads to the following conclusions:
Nature of backprop: Implication for timelines
Biologically impossible: Unclear, there might be multiple “secret sauces”
Biologically possible, but not plausible: Same as above
Biologically plausible, but not actual: Timelines are long, there’s likely a “secret sauce”
Biologically actual: Timelines are short, there’s likely no “secret sauce”
In cases where evolution could not invent backprop anyway, it’s hard to compare things. That is consistent both with backprop not being the right way to go and with it being better than whatever evolution did.
It might be objected that this question doesn’t really matter, since if neuroscientists found out that the brain does backprop, they have not thereby created any new algorithm—but merely given stronger evidence for the workability of previous algorithms. Deep learning researchers wouldn’t find this any more useful than Usain Bolt would find it useful to know that his starting stance during the sprint countdown is optimal: he’s been using it for years anyway, and is mostly just eager to go back to the gym.
However, this argument seems mistaken.
On the one hand, just because it’s not useful to deep learning practitioners does not mean it’s not useful others trying to estimated the timelines of technological development (such as policy-makers or charitable foundations).
On the other hand, I think this knowledge is very practically useful for deep learning practitioners. According to my current models, the field seems unique in combining the following features:
Long iteration loops (on the order of GPU-weeks to GPU-years) for testing new ideas.
High dependence of performance on hyperparameters, such that the right algorithm with slightly off hyperparameters will not work at all.
High dependence of performance on the amount of compute accessible, such that the differences between enough and almost enough are step-like, or qualitative rather than quantitative. Too little compute and the algorithm just doesn’t work at all.
Lack of a unified set of first principles for understanding the problems, and instead a collection of effective heuristics
This is an environment where it is critically important to develop strong priors on what should work, and to stick with those in face countless fruitless tests. Indeed, LeCun, Hinton and Bengio seem to have persevered for decades before the AI community stopped thinking they were crazy. (This is similar in some interesting ways to the state of astronomy and physics before Newton. I’ve blogged about this before here.) There’s an asymmetry such that even though training a very powerful architecture can be quick (on the order of a GPU-day), iterating over architectures to figure out which ones to train fully in the first place can be incredibly costly. As such, knowing whether gradient descent with backprop is or is not the way to go would lead enable more efficient allocation of research time (though mostly so in case backprop is not the way to go, as the majority of current researchers assume it anyway).
Appendix: Brief theoretical background
This section describes what backpropagation is, why neuroscientists have claimed it is implausible, and why some deep learning researchers think those neuroscientists are wrong. The latter arguments are basically summarised from this talk by Hinton.
Multi-layer networks with access to an error signal face the so-called “credit assignment problem”. The error of the computation will only be available at the output: a child pronouncing a word erroneously, a rodent tasting an unexpectedly nauseating liquid, a monkey mistaking a stick for a snake. However, in order for the network to improve its representations and avoid making the same mistake in the future, it has to know which representations to “blame” for the mistake. Is the monkey too prone to think long things are snakes? Or is it bad at discriminating the textures of wood and skin? Or is it bad at telling eyes from eye-sized bumps? And so forth. This problem is exacerbated by the fact that neural network models often have tens or hundreds of thousands of parameters, not to mention the human brain, which is estimated to have on the order of 1014 synapses. Backpropagation proposes to solve this problem by observing that the maths of gradient descent work out such that one can essentially send the error signal from the output, back through the network towards the input, modulating it by the strength of the connections along the way. (A complementary perspective on backprop is that it is just an efficient way of computing derivatives in large computational graphs, see e.g. Olah, 2015).
Now why do some neuroscientists have a problem with this?
Objection 1:
Most learning in the brain is unsupervised, without any error signal similar to those used in supervised learning.
Hinton’s reply:
There are at least three ways of doing backpropagation without an external supervision signal:
1. Try to reconstruct the original input (using e.g. auto-encoders), and thereby develop representations sensitive to the statistics of the input domain
2. Use the broader context of the input to train local features
For example, in the sentence “She scromed him with the frying pan”, we can infer that the sentence as a whole doesn’t sound very pleasant, and use that to update our representation of the novel word “scrom”
3. Learn a generative model that assigns high probability to the input (e.g. using variational auto-encoders or the wake-sleep algorithm from the 1990’s)
Bengio and colleagues (2017) have also done interesting work on this, partly reviving energy-minimising Hopfield networks from the 1980’s
Objection 2:
Objection 2. Neurons communicate using binary spikes, rather than real values (this was among the earliest objections to backprop).
Hinton’s reply:
First, one can just send spikes stochastically and use the expected spike rate (e.g. with a poisson rate, which is somewhat close to what real neurons do, although there are important differences see e.g., Ma et al., 2006; Pouget et al. 2003).
Second, this might make evolutionary sense, as the stochasticity acts as a regularising mechanism making the network more robust to overfitting. This behaviour is in fact where Hinton got the idea for the drop-out algorithm (which has been very popular, though it recently seems to have been largely replaced by batch normalisation).
Objection 3:
Single neurons cannot represent two distinct kind of quantities, as would be required to do backprop (the presence of features and gradients for training).
Hinton’s reply:
This is in fact possible. One can use the temporal derivative of the neuronal activity to represent gradients.
(There is interesting neuropsychological evidence supporting the idea that the temporal derivative of a neuron can not be used to represent changes in that feature, and that different populations of neurons are required to represent the presence and the change of a feature. Patients with certain brain damage seem able to recognise that a moving car occupies different locations at two points in time, without being able to ever detect the car changing position.)
Objection 4:
Cortical connections only transmit information in one direction (from soma to synapse), and the kinds of backprojections that exist are far from the perfectly symmetric ones used for backprop.
Hinton’s reply:
This led him to abandon the idea that the brain could do backpropagation for a decade, until “a miracle appeared”. Lillicrap and colleagues at DeepMind (2016) found that a network propagating gradients back through random and fixed feedback weights in the hidden layer can match the performance of one using ordinary backprop, given a mechanism for normalization and under the assumption that the weights preserve the sign of the gradients. This is a remarkable and surprising result, and indicates that backprop is still poorly understood. (See also follow-up work by Liao et al., 2016).
[1] One possible argument for this is that in a larger number of plausible worlds, if 2) is true and conceptual advances are necessary, then building superintelligence will turn into an engineering problem once those advances have been made. Hence 2) requires strictly more resources than 1).
Discussion questions
I’d encourage discussion on:
Whether the brain does backprop (object-level discussion on the work of Lillicrap, Hinton, Bengio, Liao and others)?
Whether it’s actually important for the secret sauce question to know whether the brain does backprop?
To keep things focused and manageable, it seems reasonable to disencourage discussion of what other secret sauces there might be.
- Long-Term Future Fund: April 2019 grant recommendations by 23 Apr 2019 7:00 UTC; 142 points) (EA Forum;
- Long Term Future Fund: April 2019 grant decisions by 8 Apr 2019 2:05 UTC; 53 points) (
- Rationality Feed: Last Month’s Best Posts by 21 Mar 2018 14:12 UTC; 20 points) (
- 2 Dec 2019 15:11 UTC; 11 points) 's comment on Competitive Markets as Distributed Backprop by (
- 4 Nov 2023 2:54 UTC; 2 points) 's comment on Does davidad’s uploading moonshot work? by (
Interesting fact about backprop: a supply chain of profit-maximizing, competitive companies can be viewed as implementing backprop. Obviously there’s some setup here, but it’s reasonably general; I’ll have a long post on it at some point. This should not be very surprising: backprop is just an efficient algorithm for calculating gradients, and prices in competitive markets are basically just gradients of production functions.
Anyway, my broader point is this: backprop is just an efficient way to calculate gradients. In a distributed system (e.g. a market), it’s not necessarily the most efficient gradient-calculation algorithm. What’s relevant is not whether the brain uses backpropagation per se, but whether it uses gradient descent. If the brain mainly operates off of gradient descent, then we have that theoretical tool already, regardless of the details of how the brain computes the gradient.
Many of the objections listed to brain-as-backprop only apply to single-threaded, vanilla backprop, rather than gradient descent more generally.
I’m looking forward to reading that post.
Yes, it seems right that gradient descent is the key crux. But I’m not familiar with any efficient way of doing it that the brain might implement, apart from backprop. Do you have any examples?
Here’s my preferred formulation of the general derivative problem (skip to the last paragraph if you just want the summary): you have some function f(x). We’ll assume that it’s been “flattened out”, i.e. all the loops and recursive calls have been expanded, it’s just a straight-line numerical function. Adopting hilariously bad variable names, suppose the i-th line of f computes yi. We’ll also assume that the first lines of f just load in x, so e.g. y0=x0. If f has n lines, then the output of f is yn.
Now, we create a vector-valued function F(y), which runs each line of f in parallel: Fi(y)=(line i of f evaluated at y). f(x) computes a fixed point y=F(y) (it may take a moment of thought or an example for that part to make sense). It’s that fixed point formula which we differentiate. The result: we get x=Adydx, where A is a very sparse triangular matrix. In fact, we don’t even need to solve the whole thing—we only need dyndx. Backprop just uses the usual method for solving triangular matrices: start at the end and work back.
Main point: derivative calculation, in general, can be done by solving a (sparse, triangular) system of linear equations. There’s a whole field devoted to solving sparse matrices, especially in parallel. Different methods work better depending on the matrix structure (which will follow the structure of the computation DAG of f), so different methods will work better for different functions. Pick your favorite sparse matrix solver, ideally one which will leverage triangularity, and boom, you have a derivative calculator.
Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn’t seem to be markdown, no idea what we’re using here.
It is a WYSIWYG markdown editor and dollar-sign is the symbol that opens the LaTex editor (I’ve LaTexed your comment for you, hope that’s okay).
Added: @habryka oops, double-comment!
Ooooh, that makes much more sense now, I was confused by the auto-formatting as I typed. Thank you for taking the time to clean up my comment. Also thankyou @habryka.
Also, how do images work in posts? I was writing up a post the other day, but when I tried to paste in an image it just created a camera symbol. Alternatively, is this stuff documented somewhere?
My transatlantic flight permitting, I’ll reply with a post tomorrow with full descriptions of how to use the editor.
Thank you very much! I really appreciate the time you guys are putting in to this.
You’re welcome :-) Here’s a mini-guide to the editor.
The thing is now in LaTeX! Beautiful!
Yep, we support LaTeX and do a WYSIWYG translation of markdown as soon as you type it (I.e. words between asterisks get bolded, etc.). You can start typing LaTeX by typing $ and then a small equation editor shows up. You can also insert block-level equations by pressing CTRL+M.
Typing $ does nothing on my iPhone.
Because the mobile editing experience was pretty buggy, we replaced the mobile editor with a markdown-only editor two days ago. We will activate LaTeX for that editor pretty soon (which will probably mean replacing equations between “$$” with the LaTeX rendered version), but that means LaTeX is temporarily unavailable on phones (though the previous LaTeX editor didn’t really work with phones anyways, so it’s mostly just a strict improvement on what we have).
Ok, no problem; I don’t really know LaTeX anyway.
Hello from the future! I’m interested to hear how your views have updated since this comment and post were written. 1. What is your credence that the brain learns via gradient descent? 2. What is your credence that it in fact does so in a way relevantly similar to backprop? 3. Do you still think that insofar as your credence in 1 is high, timelines are short?
I appreciate you following up on this!
The sad and honest truth, though, is that since I wrote this post, I haven’t thought about it. :( I haven’t picked up on any key new piece of evidence—though I also haven’t been looking.
I could give you credences, but that would mostly just involve rereading this and loading up all the thoughts
Ok! Well, FWIW, it seems very likely to me that the brain learns via gradient descent, and indeed probable that it does something relevantly similar (though of course not identical to) backprop. (See the link above). But I feel very much an imposter discussing all this stuff since I lack technical expertise. I’d be interested to hear your take on this stuff sometime if you have one or want to make one! See also:
https://arxiv.org/abs/2006.04182 (Brains = predictive processing = backprop = artificial neural nets)
https://www.biorxiv.org/content/10.1101/764258v2.full (IIRC this provides support for Kaplan’s view that human ability to extrapolate is really just interpolation done by a bigger brain on more and better data.)
I’m currently on vacation, but I’d be interested in setting up a call once I’m back in 2 weeks! :) I’ll send you my calendly in PM
Thanks for the excellent post, Jacob. I think you might be placing too much emphasis on learning algorithms as opposed to knowledge representations, though. It seems very likely to me that at least one theoretical breakthrough in knowledge representation will be required to make significant progress (for one argument along these lines, see Pearl 2018). Even if it turns out that the brain implements backpropagation, that breakthrough will still be a bottleneck. In biological terms, I’m thinking of the knowledge representations as analogous to innate aspects of cognition impressed upon us by evolution, and learning algorithms as what an individual human uses to learn from their experiences.
Two examples which suggest that the former are more important than the latter. The first is the “poverty of stimulus” argument in linguistics: that children simply don’t hear enough words to infer language from first principles. This suggests that ingrained grammatical instincts are doing most of the work in narrowing down what the sentences they hear mean. Even if we knew that the kids were doing backpropagation whenever they heard new sentences, that doesn’t tell us much about how that grammatical knowledge works, because you can do backpropagation on lots of different things. (You know more psycholinguistics than I do, though, so let me know if I’m misrepresenting anything).
Second example: Hinton argues in this talk that CNNs don’t create representations of three-dimensional objects from two-dimensional pictures in the same way as the human brain does; that’s why he invented capsule networks, which (he claims) do use such representations. Both capsules and CNNs use backpropagation, but the architecture of capsules is meant to be an extra “secret sauce”. Seeing whether they end up working well on vision tasks will be quite interesting, because vision is better-understood and easier than abstract thought (for example, it’s very easy to theoretically specify how to translate between any two visual perspectives, it’s just a matrix multiplication).
Lastly, as a previous commentator pointed out, it’s not backpropagation but rather gradient descent which seems like the important factor. More specifically, recent research suggests that Stochastic Gradient Descent leads to particularly good outcomes, for interesting theoretical reasons (see Zhang 2017 and this blog post by Huzcar). Since the brain does online learning, if it’s doing gradient descent then it’s doing a variant of SGD. I discuss why SGD works well in more detail in the first section of this blog post.
I had a conversation with Paul where I asked him a roughly similar question, namely “how many nontrivial theoretical insights are we away from superintelligent AI, and how quickly will they get produced?” His answer was “plausibly zero or one, but also I think we haven’t had a nontrivial theoretical insight since the 1980s” (this is my approximate recollection, Paul should correct me). We talked about it a bit more and he managed to lengthen my timeline, which was nice.
I had previously had a somewhat lower threshold for what constituted a nontrivial theoretical insight, a sense that there weren’t very many left, and a sense that they were going to happen pretty quickly, based mostly on the progress made by AlphaGo and AlphaGo Zero. Paul gave me a stronger sense that most of the recent progress has been due to improved compute and tricks.
In order for me to update on this it would be great to have concrete examples of what does and does not consistute “nontrivial theoretical insights” according to you and Paul.
E.g. what was the insight from the 1980s? And what part of the AG(Z) architecture did you initially consider nontrivial?
A more precise version of of my claim: if you gave smart grad students from 1990 access to all of the non-AI technology of 2017 (esp. software tools + hardware + data) and a big budget, it would not take them long to reach nearly state of the art performance on supervised learning and RL. For example, I think it’s pretty plausible that 20 good grad students could do it in 3 years if they were motivated and reasonably well managed.
If they are allowed to query for 1 bit of advice per month (e.g. “should we explore approach X?”) then I think it’s more likely than not that they would succeed. The advice is obviously a huge advantage, but I don’t think that it can plausibly substitute for “nontrivial theoretical insight.”
There is lots of uncertainty about that operationalization, but the main question is just whether there are way too many small things to figure out and iterate on rather than whether there are big insights.
(Generative modeling involves a little bit more machinery. I don’t have a strong view on whether they would figure out GANs or VAEs, though I’d guess so. Autoregressive models aren’t terrible anyway.)
They certainly wouldn’t come up with every trick or clever idea, but I expect they’d come up with the most important ones. With only 60 person-years they wouldn’t be able to put in very much domain-specific effort for any domain, so probably wouldn’t actually set SOTA, but I think they would likely get within a few years.
(I independently came up with the AGZ and GAN algorithms while writing safety posts, which I consider reasonable evidence that the ideas are natural and aren’t that hard. I expect there are a large number of cases of independent invention, with credit reasonably going to whoever actually gets it working.)
I don’t have as strong a view about whether this was also true in the 70s. By the late 80s, neural nets trained with backprop were a relatively prominent/popular hypothesis about how to build AGI, so you would have spent less time on alternatives. You have some simple algorithms each of which might turn out to not be obvious (like Q learning, which I think are roughly as tricky as the AGZ algorithm). You have the basic ideas for CNNs (though I haven’t looked into this extensively and don’t know how much of the idea was actually developed by 1990 vs. in 1998). I feel less comfortable betting on the grad students if you take all those things away. But realistically it’s more like a continuous increase in probability of success rather than some insight that happened in the 80s.
If you tried to improve the grad students’ performance by shipping back some critical insights, what would they be?
Do you think that solving Starcraft (by self-play) will require some major insight or will it be just a matter of incremental improvement of existing methods?
I don’t think it will require any new insight. It might require using slightly different algorithms—better techniques for scaling, different architectures to handle incomplete information, maybe a different training strategy to handle the very long time horizons; if they don’t tie their hands it’s probably also worth adding on a bunch of domain-specific junk.
Thanks for taking the time to write that up.
I updated towards a “fox” rather than “hedgehog” view of what intelligence is: you need to get many small things right, rather than one big thing. I’ll reply later if feel like I have a useful reply.
I put most weight on hypothesis there are multiple secret sauces, so even if it would come out brains use some kind of backprop, I would not expect the rest to be “just engineering”. For example there is an open problem with long-term memory, which may require architectural changes, like freezing weight, adding neurons on the way,...
Btw you have likely wrong labels on scenarios
Generally good way of looking on things! Thanks
Thanks, I’m glad you found the framing useful.
Significantly changed some of the formatting to make it more legible.
I’m very late to the party on this post but wanted to say that I found it useful to know better how recent AI advances were made and find out they seem pretty unlikely to get us AGI soon (I’m drawing that conclusion based on additional info).
My expectation is that we won’t be getting close until we can figure out how to make recurrent networks succeed at unsupervised learning since that is the nearest analogue in ML to how brains work (to the best of my knowledge).
I don’t understand why can’t you just have some neurons which represent the former, and some neurons which represent the latter?
Do you have any particular source for dropout being replaced by batch normalisation, or is it an impression from the papers you’ve been reading?
Because people thought you needed the same weights to 1) transport the gradients back, 2) send the activations forward. Having two distinct networks with the same topology and getting the weights to match was known as the “weight transport problem”. See Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.
The latter.
There is a wiki page on https://en.wikipedia.org/wiki/Neural_backpropagation which claims that this effect exists. Also maybe what we experience as night dreams is backproagation through our visual cortex, which becomes at this moment something like generative neural net.
Yes, there is a poorly understood phenomenon whereby action potentials sometimes travel back through the dendrites preceding them. This is insufficient for ML-style backprop because it rarely happens across more than one layer.
The premise that “human-level AI” must be built around some form on some form of learning (and the implication that learning is what needs to be improved) is highly dubious (not evidenced enough, at all, and completely at odds with my own intuitions besides).
As it is, deep learning can be seen “simply” as a way to approximate a mathematical function. In the case of computer vision, own could see it as a function that twiddles with the images’ pixels and outputs a result. The genius of the approach is how relatively fast we can find a function that approximates the process of interest (compared to say classical search algorithms). A big caveat: human intuition is still required in finding the right parameters to tweak the network, but it’s very conceivable that this could be improved.
Nevertheless, we don’t have human-level AI here. At the very best we can hope for, we have it’s pattern matching component. Which is an important component to be sure, but we still don’t have an understanding of “concepts”, there is no “reflection” as understood in computer science (a form of meta-programming where programming language concepts are reified and available to the programmer using the language). We need the ability to form new concepts—some of which will be patterns, but also to reason about the concepts themselves, to pattern-match on them. In short, to think about thinking. It seems like in that regard, we’re still a long way.
I think part of the assumption is that reflection can be bolted on trivially if the pattern matching is good enough. For example, consider guiding an SMT / automatic theorem prover by deep-learned heuristics, e.g. (https://arxiv.org/abs/1701.06972)[https://arxiv.org/abs/1701.06972] . We know how to express reflection in formal languages; we know how to train intuition for fuzzy stuff; me might learn how to train intuition for formal languages.
This is still borderline useless; but there is no reason, a priori, that such approached are doomed to fail. Especially since labels for training data are trivial (check the proof for correctness) and machine-discovered theorems / proofs can be added to the corpus.
There has been some work lately on derivative-free optimization of ANNs (ES mostly, but I’ve seen some other genetic-flavored work as well). They tend to be off-policy, and I’m not sure how biologically plausible that is, but something to think about w/r/t whether current DL progress is taking the same route as biological intelligence (-> getting us closer to [super]intelligence)
It seems very implausible to me that the brain would use evolutionary strategies, as it’s not clear how humans could try a sufficiently large number of parameter settings without any option for parallelisation, or store and then choose among previous configurations.
There is an algorithm called “Evolution strategies” popularized by OpenAI (although I believe that in some form it already existed) that can train neural networks without backpropagation and without storing multiple sets of parameters. You can view it as a population 1 genetic algorithm, but it really is a stochastic finite differences gradient estimator.
On supervised learning tasks it is not competitive with backpropagation, but on reinforcement learning tasks (where you can’t analytically differentiate the reward signal so you have to estimate the gradient one way or the other) it is competitive. Some follow-up works combined it with backpropagation.
I wouldn’t be surpised if the brain does something similar, since the brain never really does supervised learning, it’s either unsupervised or reinforcement learning. The brain could combine local reconstruction and auto-regression learning rules (similar to the layerwise-trained autoencoders, but also trying to predict future inputs rather than just reconstructing the current ones) and finite differences gradient estimation on reward signals propagated by the the dopaminergic pathways.
The OpenAI ES algorithm isn’t very plausible (for exactly why you said), but the general idea of: “existing parameters + random noise → revert if performance got worse, repeat” does seem like a reasonable way to end up with an approximation of the gradient. I had in mind something more like Uber AI’s Neuroevolution, which wouldn’t necessarily require parallelization or storage if the brain did some sort of fast local updating, parameter-wise.