A Ray comments on Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0

A Ray 4 Jun 2021 0:19 UTC
6 points
I think the engadget article failed to capture the relevant info, so just putting my preliminary thoughts down here. I expect my thoughts to change as more info is revealed/translated.

Loss on the dataset (for cross-entropy this is measured in bits of perplexity per token or per character) is a more important metric than parameter count, in my opinion.

However, I think parameter count does matter at least a small part because it is a signal for:
* the amount of resources that are available to the researchers (very expensive to do very large runs)
* the amount of engineering capacity that the project has access to (difficult to write code that functions well at that scale—nontrivial to just code a working 1.7T parameter model training loop)

I expect more performance metrics at some point, on the normal set of performance benchmarks.

I also expect to be very interested in how they release/share/license the model (if at all), and who is allowed access to it.
- axioman 4 Jun 2021 20:24 UTC
  1 point
  Parent
  If I understood correctly, the model was trained in Chinese and probably quite expensive to train.
  Do you know whether these Chinese models usually get “translated” to English, or whether there is a “fair” way of comparing models that were (mainly) trained on different languages (I’d imagine that even the tokenization might be quite different for Chinese)?
  - A Ray 6 Jun 2021 2:21 UTC
    1 point
    Parent
    In my experience, I haven’t seen a good “translation” process—instead models are pretrained on bigger and bigger corpora which include more languages.
    
    GPT-3 was trained on data that was mostly english, but also is able to (AFAICT) generate other languages as well.
    
    For some english-dependent metrics (SuperGLUE, Winogrande, LAMBADA, etc) I expect a model trained on primarily non-english corpora would do worse.
    
    Also, yes, the tokenization I would expect to be different for a largely different corpora.