It does of course apply to the visual cortex, so I don’t understand your comment. Essentially the answer is #2 in the OP’s list. CNNs are like the visual cortex but highly compressed through weight sharing, which is easy for a von neumman machine but isn’t really feasible for a slow neuromorphic computer like the brain.
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Consider a vision transformer—or more generally an RNN—which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn’t exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints.
But of course that isn’t the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture—a massive RNN—doesn’t naturally map to matrix multiplication at all and thus can’t easily exploit GPU acceleration.
But this would apply to the visual cortex as well right? So it doesn’t explain the discrepancy.
It does of course apply to the visual cortex, so I don’t understand your comment. Essentially the answer is #2 in the OP’s list. CNNs are like the visual cortex but highly compressed through weight sharing, which is easy for a von neumman machine but isn’t really feasible for a slow neuromorphic computer like the brain.
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?
Consider a vision transformer—or more generally an RNN—which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn’t exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints.
But of course that isn’t the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture—a massive RNN—doesn’t naturally map to matrix multiplication at all and thus can’t easily exploit GPU acceleration.
Oh, I see what you’re saying now. Thanks for clarifying.