Interesting fact about backprop: a supply chain of profit-maximizing, competitive companies can be viewed as implementing backprop. Obviously there’s some setup here, but it’s reasonably general; I’ll have a long post on it at some point. This should not be very surprising: backprop is just an efficient algorithm for calculating gradients, and prices in competitive markets are basically just gradients of production functions.
Anyway, my broader point is this: backprop is just an efficient way to calculate gradients. In a distributed system (e.g. a market), it’s not necessarily the most efficient gradient-calculation algorithm. What’s relevant is not whether the brain uses backpropagation per se, but whether it uses gradient descent. If the brain mainly operates off of gradient descent, then we have that theoretical tool already, regardless of the details of how the brain computes the gradient.
Many of the objections listed to brain-as-backprop only apply to single-threaded, vanilla backprop, rather than gradient descent more generally.
Yes, it seems right that gradient descent is the key crux. But I’m not familiar with any efficient way of doing it that the brain might implement, apart from backprop. Do you have any examples?
Here’s my preferred formulation of the general derivative problem (skip to the last paragraph if you just want the summary): you have some function f(x). We’ll assume that it’s been “flattened out”, i.e. all the loops and recursive calls have been expanded, it’s just a straight-line numerical function. Adopting hilariously bad variable names, suppose the i-th line of f computes yi. We’ll also assume that the first lines of f just load in x, so e.g. y0=x0. If f has n lines, then the output of f is yn.
Now, we create a vector-valued function F(y), which runs each line of f in parallel: Fi(y)=(line i of f evaluated at y). f(x) computes a fixed point y=F(y) (it may take a moment of thought or an example for that part to make sense). It’s that fixed point formula which we differentiate. The result: we get x=Adydx, where A is a very sparse triangular matrix. In fact, we don’t even need to solve the whole thing—we only need dyndx. Backprop just uses the usual method for solving triangular matrices: start at the end and work back.
Main point: derivative calculation, in general, can be done by solving a (sparse, triangular) system of linear equations. There’s a whole field devoted to solving sparse matrices, especially in parallel. Different methods work better depending on the matrix structure (which will follow the structure of the computation DAG of f), so different methods will work better for different functions. Pick your favorite sparse matrix solver, ideally one which will leverage triangularity, and boom, you have a derivative calculator.
Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn’t seem to be markdown, no idea what we’re using here.
Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn’t seem to be markdown, no idea what we’re using here.
It is a WYSIWYG markdown editor and dollar-sign is the symbol that opens the LaTex editor (I’ve LaTexed your comment for you, hope that’s okay).
Ooooh, that makes much more sense now, I was confused by the auto-formatting as I typed. Thank you for taking the time to clean up my comment. Also thankyou @habryka.
Also, how do images work in posts? I was writing up a post the other day, but when I tried to paste in an image it just created a camera symbol. Alternatively, is this stuff documented somewhere?
Yep, we support LaTeX and do a WYSIWYG translation of markdown as soon as you type it (I.e. words between asterisks get bolded, etc.). You can start typing LaTeX by typing $ and then a small equation editor shows up. You can also insert block-level equations by pressing CTRL+M.
Because the mobile editing experience was pretty buggy, we replaced the mobile editor with a markdown-only editor two days ago. We will activate LaTeX for that editor pretty soon (which will probably mean replacing equations between “$$” with the LaTeX rendered version), but that means LaTeX is temporarily unavailable on phones (though the previous LaTeX editor didn’t really work with phones anyways, so it’s mostly just a strict improvement on what we have).
Hello from the future! I’m interested to hear how your views have updated since this comment and post were written. 1. What is your credence that the brain learns via gradient descent? 2. What is your credence that it in fact does so in a way relevantly similar to backprop? 3. Do you still think that insofar as your credence in 1 is high, timelines are short?
The sad and honest truth, though, is that since I wrote this post, I haven’t thought about it. :( I haven’t picked up on any key new piece of evidence—though I also haven’t been looking.
I could give you credences, but that would mostly just involve rereading this and loading up all the thoughts
Ok! Well, FWIW, it seems very likely to me that the brain learns via gradient descent, and indeed probable that it does something relevantly similar (though of course not identical to) backprop. (See the link above). But I feel very much an imposter discussing all this stuff since I lack technical expertise. I’d be interested to hear your take on this stuff sometime if you have one or want to make one! See also:
Interesting fact about backprop: a supply chain of profit-maximizing, competitive companies can be viewed as implementing backprop. Obviously there’s some setup here, but it’s reasonably general; I’ll have a long post on it at some point. This should not be very surprising: backprop is just an efficient algorithm for calculating gradients, and prices in competitive markets are basically just gradients of production functions.
Anyway, my broader point is this: backprop is just an efficient way to calculate gradients. In a distributed system (e.g. a market), it’s not necessarily the most efficient gradient-calculation algorithm. What’s relevant is not whether the brain uses backpropagation per se, but whether it uses gradient descent. If the brain mainly operates off of gradient descent, then we have that theoretical tool already, regardless of the details of how the brain computes the gradient.
Many of the objections listed to brain-as-backprop only apply to single-threaded, vanilla backprop, rather than gradient descent more generally.
I’m looking forward to reading that post.
Yes, it seems right that gradient descent is the key crux. But I’m not familiar with any efficient way of doing it that the brain might implement, apart from backprop. Do you have any examples?
Here’s my preferred formulation of the general derivative problem (skip to the last paragraph if you just want the summary): you have some function f(x). We’ll assume that it’s been “flattened out”, i.e. all the loops and recursive calls have been expanded, it’s just a straight-line numerical function. Adopting hilariously bad variable names, suppose the i-th line of f computes yi. We’ll also assume that the first lines of f just load in x, so e.g. y0=x0. If f has n lines, then the output of f is yn.
Now, we create a vector-valued function F(y), which runs each line of f in parallel: Fi(y)=(line i of f evaluated at y). f(x) computes a fixed point y=F(y) (it may take a moment of thought or an example for that part to make sense). It’s that fixed point formula which we differentiate. The result: we get x=Adydx, where A is a very sparse triangular matrix. In fact, we don’t even need to solve the whole thing—we only need dyndx. Backprop just uses the usual method for solving triangular matrices: start at the end and work back.
Main point: derivative calculation, in general, can be done by solving a (sparse, triangular) system of linear equations. There’s a whole field devoted to solving sparse matrices, especially in parallel. Different methods work better depending on the matrix structure (which will follow the structure of the computation DAG of f), so different methods will work better for different functions. Pick your favorite sparse matrix solver, ideally one which will leverage triangularity, and boom, you have a derivative calculator.
Side note: do these comments support LaTeX? Is there a page explaining what comments do support? It doesn’t seem to be markdown, no idea what we’re using here.
It is a WYSIWYG markdown editor and dollar-sign is the symbol that opens the LaTex editor (I’ve LaTexed your comment for you, hope that’s okay).
Added: @habryka oops, double-comment!
Ooooh, that makes much more sense now, I was confused by the auto-formatting as I typed. Thank you for taking the time to clean up my comment. Also thankyou @habryka.
Also, how do images work in posts? I was writing up a post the other day, but when I tried to paste in an image it just created a camera symbol. Alternatively, is this stuff documented somewhere?
My transatlantic flight permitting, I’ll reply with a post tomorrow with full descriptions of how to use the editor.
Thank you very much! I really appreciate the time you guys are putting in to this.
You’re welcome :-) Here’s a mini-guide to the editor.
The thing is now in LaTeX! Beautiful!
Yep, we support LaTeX and do a WYSIWYG translation of markdown as soon as you type it (I.e. words between asterisks get bolded, etc.). You can start typing LaTeX by typing $ and then a small equation editor shows up. You can also insert block-level equations by pressing CTRL+M.
Typing $ does nothing on my iPhone.
Because the mobile editing experience was pretty buggy, we replaced the mobile editor with a markdown-only editor two days ago. We will activate LaTeX for that editor pretty soon (which will probably mean replacing equations between “$$” with the LaTeX rendered version), but that means LaTeX is temporarily unavailable on phones (though the previous LaTeX editor didn’t really work with phones anyways, so it’s mostly just a strict improvement on what we have).
Ok, no problem; I don’t really know LaTeX anyway.
Hello from the future! I’m interested to hear how your views have updated since this comment and post were written. 1. What is your credence that the brain learns via gradient descent? 2. What is your credence that it in fact does so in a way relevantly similar to backprop? 3. Do you still think that insofar as your credence in 1 is high, timelines are short?
I appreciate you following up on this!
The sad and honest truth, though, is that since I wrote this post, I haven’t thought about it. :( I haven’t picked up on any key new piece of evidence—though I also haven’t been looking.
I could give you credences, but that would mostly just involve rereading this and loading up all the thoughts
Ok! Well, FWIW, it seems very likely to me that the brain learns via gradient descent, and indeed probable that it does something relevantly similar (though of course not identical to) backprop. (See the link above). But I feel very much an imposter discussing all this stuff since I lack technical expertise. I’d be interested to hear your take on this stuff sometime if you have one or want to make one! See also:
https://arxiv.org/abs/2006.04182 (Brains = predictive processing = backprop = artificial neural nets)
https://www.biorxiv.org/content/10.1101/764258v2.full (IIRC this provides support for Kaplan’s view that human ability to extrapolate is really just interpolation done by a bigger brain on more and better data.)
I’m currently on vacation, but I’d be interested in setting up a call once I’m back in 2 weeks! :) I’ll send you my calendly in PM