Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.
Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It’s very easy to mislead with statistics—for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.
That’s true, though I do think there are various proxies that make at least the extreme end of this kind of thing for currently deployed models relatively easy to rule out (like the compute-purchase and allocation decisions of major cloud providers who host some of these models, and staff allocation and various other things).
I do think most organizations who claim parity with GPT-4 or Sonnet are almost always overstating things. My experience with 405b suggests it is also not at the level of Claude 3.5 Sonnet, but it does seem to be at the level of the original GPT-4, though I am not confident since I haven’t played around that much with it GPT-4 recently.
Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.
Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It’s very easy to mislead with statistics—for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.
That’s true, though I do think there are various proxies that make at least the extreme end of this kind of thing for currently deployed models relatively easy to rule out (like the compute-purchase and allocation decisions of major cloud providers who host some of these models, and staff allocation and various other things).
I do think most organizations who claim parity with GPT-4 or Sonnet are almost always overstating things. My experience with 405b suggests it is also not at the level of Claude 3.5 Sonnet, but it does seem to be at the level of the original GPT-4, though I am not confident since I haven’t played around that much with it GPT-4 recently.