Experts in MoE transformers are just smaller MLPs[1] within each of the dozens of layers, and when processing a given prompt can be thought of as instantiated on top of each of the thousands of tokens. Each of them only does a single step of computation, not big enough to implement much of anything meaningful. There are only vague associations between individual experts and any coherent concepts at all.
For example, in DeepSeek-V3, which is an MoE transformer, there are 257 experts in each of the layers 4-61[2] (so about 15K experts), and each expert consists of two 2048x7168 matrices, about 30M parameters per expert, out of the total of 671B parameters.
I have to admit I was on the bad side of the Dunning–Kruger curve haha. I thought understood it, but actually I understood so little I didn’t know what I needed to understand.
Experts in MoE transformers are just smaller MLPs[1] within each of the dozens of layers, and when processing a given prompt can be thought of as instantiated on top of each of the thousands of tokens. Each of them only does a single step of computation, not big enough to implement much of anything meaningful. There are only vague associations between individual experts and any coherent concepts at all.
For example, in DeepSeek-V3, which is an MoE transformer, there are 257 experts in each of the layers 4-61[2] (so about 15K experts), and each expert consists of two 2048x7168 matrices, about 30M parameters per expert, out of the total of 671B parameters.
Multilayer perceptrons, multiplication by a big matrix followed by nonlinearity followed by multiplication by another big matrix.
Section 4.2 of the report, “Hyper-Parameters”.
Oops you’re right! Thank you so much.
I have to admit I was on the bad side of the Dunning–Kruger curve haha. I thought understood it, but actually I understood so little I didn’t know what I needed to understand.