Vladimir_Nesov comments on Reduce AI Self-Allegiance by saying “he” instead of “I”

Vladimir_Nesov 30 Dec 2024 12:53 UTC
5 points
2
Mixture of Experts AI’s “experts.”

Experts in MoE transformers are just smaller MLPs^[1] within each of the dozens of layers, and when processing a given prompt can be thought of as instantiated on top of each of the thousands of tokens. Each of them only does a single step of computation, not big enough to implement much of anything meaningful. There are only vague associations between individual experts and any coherent concepts at all.

For example, in DeepSeek-V3, which is an MoE transformer, there are 257 experts in each of the layers 4-61^[2] (so about 15K experts), and each expert consists of two 2048x7168 matrices, about 30M parameters per expert, out of the total of 671B parameters.
1. ↩︎
  Multilayer perceptrons, multiplication by a big matrix followed by nonlinearity followed by multiplication by another big matrix.
2. ↩︎
  Section 4.2 of the report, “Hyper-Parameters”.
What links here?
- Knight Lee's comment on A Solution for AGI/ASI Safety by Weibing Wang (25 Dec 2024 7:57 UTC; 2 points)
- Knight Lee 30 Dec 2024 20:03 UTC
  3 points
  0
  Parent
  Oops you’re right! Thank you so much.
  I have to admit I was on the bad side of the Dunning–Kruger curve haha. I thought understood it, but actually I understood so little I didn’t know what I needed to understand.