Ahh this is extremely simple and obvious to those with deep wisdom from computer graphics—the closer you get to low level physics cellular automata the more effective parameter sharing is. Conceptualize the space of all physics approximation functions organized by spatiotemporal scale. At the most detailed finest scale physics is uniform across space and time so a small amount of code/params describes all of spacetime. As you advance up the approximation abstraction tree to larger temporal spatial scales code complexity and specificity increases with language models sitting at the top.
At the lowest levels of vision all the features are spatially invariant and thus the natural connection matrix is highly compressible through simple weigh sharing, but diminishing exponentially as you advance up abstraction levels.
It does of course apply to the visual cortex, so I don’t understand your comment. Essentially the answer is #2 in the OP’s list. CNNs are like the visual cortex but highly compressed through weight sharing, which is easy for a von neumman machine but isn’t really feasible for a slow neuromorphic computer like the brain.
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Consider a vision transformer—or more generally an RNN—which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn’t exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints.
But of course that isn’t the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture—a massive RNN—doesn’t naturally map to matrix multiplication at all and thus can’t easily exploit GPU acceleration.
Ahh this is extremely simple and obvious to those with deep wisdom from computer graphics—the closer you get to low level physics cellular automata the more effective parameter sharing is. Conceptualize the space of all physics approximation functions organized by spatiotemporal scale. At the most detailed finest scale physics is uniform across space and time so a small amount of code/params describes all of spacetime. As you advance up the approximation abstraction tree to larger temporal spatial scales code complexity and specificity increases with language models sitting at the top.
At the lowest levels of vision all the features are spatially invariant and thus the natural connection matrix is highly compressible through simple weigh sharing, but diminishing exponentially as you advance up abstraction levels.
But this would apply to the visual cortex as well right? So it doesn’t explain the discrepancy.
It does of course apply to the visual cortex, so I don’t understand your comment. Essentially the answer is #2 in the OP’s list. CNNs are like the visual cortex but highly compressed through weight sharing, which is easy for a von neumman machine but isn’t really feasible for a slow neuromorphic computer like the brain.
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?
Consider a vision transformer—or more generally an RNN—which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn’t exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints.
But of course that isn’t the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture—a massive RNN—doesn’t naturally map to matrix multiplication at all and thus can’t easily exploit GPU acceleration.
Oh, I see what you’re saying now. Thanks for clarifying.