beren comments on The surprising parameter efficiency of vision models

beren 9 Apr 2023 19:17 UTC
4 points
0
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?
- jacob_cannell 9 Apr 2023 19:55 UTC
  6 points
  1
  Parent
  Consider a vision transformer—or more generally an RNN—which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn’t exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints.
  
  But of course that isn’t the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture—a massive RNN—doesn’t naturally map to matrix multiplication at all and thus can’t easily exploit GPU acceleration.