As I understand it, Google’s proposed model is a MoE model, and I’ve heard MoE models achieve poorer understanding for equivalent parameter count than classical transformer models do.
It might be more useful to discuss Google’s dense GPT-like LaMDA-137b instead, because there’s so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it’s not like they’re fooling anyone), but they’ve been doing interesting OA-like research with it: eg “Program Synthesis with Large Language Models”, “Finetuned Language Models Are Zero-Shot Learners”, or text style transfer.
As I understand it, Google’s proposed model is a MoE model, and I’ve heard MoE models achieve poorer understanding for equivalent parameter count than classical transformer models do.
It might be more useful to discuss Google’s dense GPT-like LaMDA-137b instead, because there’s so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it’s not like they’re fooling anyone), but they’ve been doing interesting OA-like research with it: eg “Program Synthesis with Large Language Models”, “Finetuned Language Models Are Zero-Shot Learners”, or text style transfer.