Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
But how good it can be, realistically? I will be so so much surprised if all this details wont be leaked in next week. May be they will try to make several false leaks to muddle things a bit.
Strong agreement here. I find it unlikely that most of these details will still be concealed after 3 months or so, as it seems unlikely, combined, that no one will be able to infer some of these details or that there will be no leak.
Regarding the original thread, I do agree that OpenAI’s move to conceal the details of the model is a Good Thing, as this step is risk-reducing and creates / furthers a norm for safety in AI development that might be adopted elsewhere. Nonetheless, the information being concealed seems likely to become known soon, in my mind, for the general reasons I outlined in the previous paragraph.
You can definitely infer quite a bit from the paper and authors by section, but there is a big difference between a plausible informed guess, and knowing. For most purposes, weak inferences are not too useful. ‘Oh, this is Chinchilla, this is VQ-VAE, this is Scaling Transformer...’ For example, the predicting-scaling part (and Sam Altman singling out the author for praise) is clearly the zero-shot hyperparameter work, but that’s not terribly helpful, because the whole point of scaling laws (and the mu work in particular) is that if you don’t get it right, you’ll fall off the optimal scaling curves badly if you try to scale up 10,000x to GPT-4 (never mind the GPT-5 OA has in progress), and you probably can’t just apply the papers blindly—you need to reinvent whatever he invented since and accumulate the same data, with no guarantee you’ll do it. Not a great premise on which to spend $1b or so. If you’re a hyperscaler not already committed to the AI arms race, this is not enough information, or reliable enough, to move the needle on your major strategic decision. Whereas if they had listed exact formulas or results (especially the negative results), it may be enough of a roadmap to kickstart another competitor a few months or years earlier.
By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer”? I’ve been sceptical of NTK-based theory, seems I should update.
Wow, that’s good, right?
Yes. How good is up for debate, but it’s definitely good.
But how good it can be, realistically? I will be so so much surprised if all this details wont be leaked in next week. May be they will try to make several false leaks to muddle things a bit.
It could leak when OAI employees take an offer to work at another lab.
Strong agreement here. I find it unlikely that most of these details will still be concealed after 3 months or so, as it seems unlikely, combined, that no one will be able to infer some of these details or that there will be no leak.
Regarding the original thread, I do agree that OpenAI’s move to conceal the details of the model is a Good Thing, as this step is risk-reducing and creates / furthers a norm for safety in AI development that might be adopted elsewhere. Nonetheless, the information being concealed seems likely to become known soon, in my mind, for the general reasons I outlined in the previous paragraph.
You can definitely infer quite a bit from the paper and authors by section, but there is a big difference between a plausible informed guess, and knowing. For most purposes, weak inferences are not too useful. ‘Oh, this is Chinchilla, this is VQ-VAE, this is Scaling Transformer...’ For example, the predicting-scaling part (and Sam Altman singling out the author for praise) is clearly the zero-shot hyperparameter work, but that’s not terribly helpful, because the whole point of scaling laws (and the mu work in particular) is that if you don’t get it right, you’ll fall off the optimal scaling curves badly if you try to scale up 10,000x to GPT-4 (never mind the GPT-5 OA has in progress), and you probably can’t just apply the papers blindly—you need to reinvent whatever he invented since and accumulate the same data, with no guarantee you’ll do it. Not a great premise on which to spend $1b or so. If you’re a hyperscaler not already committed to the AI arms race, this is not enough information, or reliable enough, to move the needle on your major strategic decision. Whereas if they had listed exact formulas or results (especially the negative results), it may be enough of a roadmap to kickstart another competitor a few months or years earlier.
By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer”? I’ve been sceptical of NTK-based theory, seems I should update.
Is there even enough training data for GPT-5? (Assuming it’s goal is to 50x or 100x GPT-4)
Not public data, at least.
Yep, but of course the common opinion on Hacker News is that this is horrible.
something something silver linings...