Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.
What are the rumors? I’m only aware of MoE.
Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
There is a huge diversity of techniques combining the MoE motifs and motifs associated with Transformers, see e.g. this collection of references https://github.com/XueFuzhao/awesome-mixture-of-experts
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.