My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.
Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained strongly by base model quality. I would agree that open source is probably better at squeezing stuff out of small models, but my model is that wrt existential risk relevant capabilities progress this is less relevant (cf the bitter lesson).
Are you implying that it is close to GPT-4 level? If yes, it is clearly wrong. Especially in regards to code: everything (maybe except StarCoder which was released literally yesterday) is worse than GPT-3.5, and much worse than GPT-4.
I’ve tried StarCoder recently, though, and it’s pretty impressive. I haven’t yet tried to really stress-test it, but at the very least it can generate basic code with a parameter count way lower than Copilot’s.
Does the code it writes work? ChatGPT can usually write a working Python module on its first try, and can make adjustments or fix bugs if you ask it to. All the local models I’ve tried so far could not keep it coherent for something that long. In one case it even tried to close a couple of Python blocks with curly braces. Maybe I’m just using the wrong settings.
My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.
Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained strongly by base model quality. I would agree that open source is probably better at squeezing stuff out of small models, but my model is that wrt existential risk relevant capabilities progress this is less relevant (cf the bitter lesson).
This comment has gotten lots of upvotes but, has anyone here tried Vicuna-13B?
I have. It seems pretty good (not obviously worse than ChatGPT 3.5) at short conversational prompts, haven’t tried technical or reasoning tasks.
Are you implying that it is close to GPT-4 level? If yes, it is clearly wrong. Especially in regards to code: everything (maybe except StarCoder which was released literally yesterday) is worse than GPT-3.5, and much worse than GPT-4.
I’ve tried StarCoder recently, though, and it’s pretty impressive. I haven’t yet tried to really stress-test it, but at the very least it can generate basic code with a parameter count way lower than Copilot’s.
Does the code it writes work? ChatGPT can usually write a working Python module on its first try, and can make adjustments or fix bugs if you ask it to. All the local models I’ve tried so far could not keep it coherent for something that long. In one case it even tried to close a couple of Python blocks with curly braces. Maybe I’m just using the wrong settings.