Yeah. I think it is fairly arguable that open data sets aren’t required for open source; it’s not the form you’d prefer to modify it in as a programmer, and they’re not exactly code to start with, Shakespeare didn’t write his plays as programming instructions for algorithms to generate Shakespeare-like plays. No one wants a trillion tokens that take ~$200k to ‘compile’ as their starting point to build from after someone else has already done that and made it available. (Hyperbolic; but the reasons someone wants that generally aren’t the same reasons they’d want to compile from source code.) Open datasets are nice information to have, but lack a reasonable reproduction cost or much direct utility beyond explanation.
Llama’s state as not open source is much more strongly made by the restrictions on usage, as noted in the prior discussion. There is a meaningful distinction between open dataset and closed dataset, but I’d put the jump between Mistral and Llama to be from ‘open source with hidden supply chain’ to ‘open weights with restrictive licensing’, where the jump between Mistral and RedPajama is more ‘open source with hidden supply chain’ to ‘open source with revealed supply chain’.
I agree that the reasons someone wants the dataset generally aren’t the same reasons they’d want to compile from source code. But there’s a lot of utility for research in having access to the dataset even if you don’t recompile. Checking whether there was test-set leakage for metrics, for example, or assessing how much of LLM ability is stochastic parroting of specific passages versus recombination. And if it was actually open, these would not be hidden from researchers.
And supply chain is a reasonable analogy—but many open-source advocates make sure that their code doesn’t depend on closed / proprietary libraries. It’s not actually “libre” if you need to have a closed source component or pay someone to make the thing work. Some advocates, those who built or control quite a lot of the total open source ecosystem, also put effort into ensuring that the entire toolchain needed to compile their code is open, because replicability shouldn’t be contingent on companies that can restrict usage or hide things in the code. It’s not strictly required, but it’s certainly relevant.
Yeah. I think it is fairly arguable that open data sets aren’t required for open source; it’s not the form you’d prefer to modify it in as a programmer, and they’re not exactly code to start with, Shakespeare didn’t write his plays as programming instructions for algorithms to generate Shakespeare-like plays. No one wants a trillion tokens that take ~$200k to ‘compile’ as their starting point to build from after someone else has already done that and made it available. (Hyperbolic; but the reasons someone wants that generally aren’t the same reasons they’d want to compile from source code.) Open datasets are nice information to have, but lack a reasonable reproduction cost or much direct utility beyond explanation.
Llama’s state as not open source is much more strongly made by the restrictions on usage, as noted in the prior discussion. There is a meaningful distinction between open dataset and closed dataset, but I’d put the jump between Mistral and Llama to be from ‘open source with hidden supply chain’ to ‘open weights with restrictive licensing’, where the jump between Mistral and RedPajama is more ‘open source with hidden supply chain’ to ‘open source with revealed supply chain’.
I agree that the reasons someone wants the dataset generally aren’t the same reasons they’d want to compile from source code. But there’s a lot of utility for research in having access to the dataset even if you don’t recompile. Checking whether there was test-set leakage for metrics, for example, or assessing how much of LLM ability is stochastic parroting of specific passages versus recombination. And if it was actually open, these would not be hidden from researchers.
And supply chain is a reasonable analogy—but many open-source advocates make sure that their code doesn’t depend on closed / proprietary libraries. It’s not actually “libre” if you need to have a closed source component or pay someone to make the thing work. Some advocates, those who built or control quite a lot of the total open source ecosystem, also put effort into ensuring that the entire toolchain needed to compile their code is open, because replicability shouldn’t be contingent on companies that can restrict usage or hide things in the code. It’s not strictly required, but it’s certainly relevant.