Throwaway2367 comments on GPT-4 solves Gary Marcus-induced flubs

Throwaway2367 30 Mar 2023 12:32 UTC
1 point
0
This was my impression too, and I’m glad someone else said it. When I try out past examples (from a week ago) of chatgpt getting things wrong, I very often observe that it is correct now. Of course, annoyingly people often report on chatgpt4 capabilities while they tried out chatgpt3.5, but still, i feel like it has improved. Is it a crazy possibility that OpenAI trains gpt4 and periodically swaps out the deployed model? As far as I can tell the only source stating that GPT-5 is in training is the Morgan Stanley report, but what if it is actually not GPT-5, rather a continually trained GPT-4 which is running on those GPUs?

Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models? (I guess you can always stack more layers at the end, but surely that simple method has some negatives) It would be useful to stay on the scaling curves without restarting from scrath with a larger model.
- gwern 30 Mar 2023 14:12 UTC
  6 points
  0
  Parent
  
  Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models?
  
  Yes. This fits under a couple terms: hot-starting, warm initialization with model surgery a la OA5, slow weights vs fast weights / meta-learning, tied weights, etc. It’s also a fairly common idea in Neural Architecture Search where you try to learn a small ‘cell’ or ‘module’ (either just the architecture or the weights as well) cheaply and then stack a bunch of them to get your final SOTA model, and can be combined eg. SMASH. An example of using this to train very large models is “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021. It seems appealing but raises questions about efficiency & bias: are you really still on the same scaling curve as the ‘true’ large model, given that the smaller model you are training almost by definition has a different (worse) scaling curve, and might you not be sabotaging your final model by hardwiring the weaknesses of the small initial model into it, rendering the approach penny-wise pound-foolish?