I think the compiled binary analogy isn’t quite right. For instance, the vast majority of modifications and experiments people want to run are possible (and easiest) with just access to the weights in the LLM case.
As in, if you want to modify an LLM to be slightly different, access to the original training code or dataset is mostly unimportant.
(Edit: unlike the software case where modifying compiled binaries to have different behavior isn’t really doable without the source code.)
The vast majority of uses of software are via changing configuration and inputs, not modifying code and recompiling the software. (Though lots of Software as a Service doesn’t even let you change configuration directly.) But software is not open in this sense unless you can recompile, because it’s not actually giving you full access to what was used to build it.
The same is the case for what Facebook call open-source LLMs; it’s not actually giving you full access to what was used to build it.
The vast majority of ordinary uses of LLMs (e.g. when using ChatGPT) are via changing and configuring inputs, not modifying code or fine-tuning the model. This still seems analogous to ordinary software, in my opinion, making Ryan Greenblatt’s point apt.
(But I agree that simply releasing model weights is not fully open source. I think these things exist on a spectrum. Releasing model weights could be considered a form of partially open sourcing the model.)
I agree that releasing model weights is “partially open sourcing”—in much the same way that freeware is “partially open sourcing” software, or restrictive licences with code availability is.
But that’s exactly the point; you don’t get to call something X because it’s kind-of-like X, it needs to actually fulfill the requirements in order to get the label. What is being called Open Source AI doesn’t actually do the thing that it needs to.
Yeah. I think it is fairly arguable that open data sets aren’t required for open source; it’s not the form you’d prefer to modify it in as a programmer, and they’re not exactly code to start with, Shakespeare didn’t write his plays as programming instructions for algorithms to generate Shakespeare-like plays. No one wants a trillion tokens that take ~$200k to ‘compile’ as their starting point to build from after someone else has already done that and made it available. (Hyperbolic; but the reasons someone wants that generally aren’t the same reasons they’d want to compile from source code.) Open datasets are nice information to have, but lack a reasonable reproduction cost or much direct utility beyond explanation.
Llama’s state as not open source is much more strongly made by the restrictions on usage, as noted in the prior discussion. There is a meaningful distinction between open dataset and closed dataset, but I’d put the jump between Mistral and Llama to be from ‘open source with hidden supply chain’ to ‘open weights with restrictive licensing’, where the jump between Mistral and RedPajama is more ‘open source with hidden supply chain’ to ‘open source with revealed supply chain’.
I agree that the reasons someone wants the dataset generally aren’t the same reasons they’d want to compile from source code. But there’s a lot of utility for research in having access to the dataset even if you don’t recompile. Checking whether there was test-set leakage for metrics, for example, or assessing how much of LLM ability is stochastic parroting of specific passages versus recombination. And if it was actually open, these would not be hidden from researchers.
And supply chain is a reasonable analogy—but many open-source advocates make sure that their code doesn’t depend on closed / proprietary libraries. It’s not actually “libre” if you need to have a closed source component or pay someone to make the thing work. Some advocates, those who built or control quite a lot of the total open source ecosystem, also put effort into ensuring that the entire toolchain needed to compile their code is open, because replicability shouldn’t be contingent on companies that can restrict usage or hide things in the code. It’s not strictly required, but it’s certainly relevant.
I think the compiled binary analogy isn’t quite right. For instance, the vast majority of modifications and experiments people want to run are possible (and easiest) with just access to the weights in the LLM case.
As in, if you want to modify an LLM to be slightly different, access to the original training code or dataset is mostly unimportant.
(Edit: unlike the software case where modifying compiled binaries to have different behavior isn’t really doable without the source code.)
The vast majority of uses of software are via changing configuration and inputs, not modifying code and recompiling the software. (Though lots of Software as a Service doesn’t even let you change configuration directly.) But software is not open in this sense unless you can recompile, because it’s not actually giving you full access to what was used to build it.
The same is the case for what Facebook call open-source LLMs; it’s not actually giving you full access to what was used to build it.
The vast majority of ordinary uses of LLMs (e.g. when using ChatGPT) are via changing and configuring inputs, not modifying code or fine-tuning the model. This still seems analogous to ordinary software, in my opinion, making Ryan Greenblatt’s point apt.
(But I agree that simply releasing model weights is not fully open source. I think these things exist on a spectrum. Releasing model weights could be considered a form of partially open sourcing the model.)
I agree that releasing model weights is “partially open sourcing”—in much the same way that freeware is “partially open sourcing” software, or restrictive licences with code availability is.
But that’s exactly the point; you don’t get to call something X because it’s kind-of-like X, it needs to actually fulfill the requirements in order to get the label. What is being called Open Source AI doesn’t actually do the thing that it needs to.
I don’t think that’s accurate. the data is the code, the model is just the binary format the code gets compiled to.
Yeah. I think it is fairly arguable that open data sets aren’t required for open source; it’s not the form you’d prefer to modify it in as a programmer, and they’re not exactly code to start with, Shakespeare didn’t write his plays as programming instructions for algorithms to generate Shakespeare-like plays. No one wants a trillion tokens that take ~$200k to ‘compile’ as their starting point to build from after someone else has already done that and made it available. (Hyperbolic; but the reasons someone wants that generally aren’t the same reasons they’d want to compile from source code.) Open datasets are nice information to have, but lack a reasonable reproduction cost or much direct utility beyond explanation.
Llama’s state as not open source is much more strongly made by the restrictions on usage, as noted in the prior discussion. There is a meaningful distinction between open dataset and closed dataset, but I’d put the jump between Mistral and Llama to be from ‘open source with hidden supply chain’ to ‘open weights with restrictive licensing’, where the jump between Mistral and RedPajama is more ‘open source with hidden supply chain’ to ‘open source with revealed supply chain’.
I agree that the reasons someone wants the dataset generally aren’t the same reasons they’d want to compile from source code. But there’s a lot of utility for research in having access to the dataset even if you don’t recompile. Checking whether there was test-set leakage for metrics, for example, or assessing how much of LLM ability is stochastic parroting of specific passages versus recombination. And if it was actually open, these would not be hidden from researchers.
And supply chain is a reasonable analogy—but many open-source advocates make sure that their code doesn’t depend on closed / proprietary libraries. It’s not actually “libre” if you need to have a closed source component or pay someone to make the thing work. Some advocates, those who built or control quite a lot of the total open source ecosystem, also put effort into ensuring that the entire toolchain needed to compile their code is open, because replicability shouldn’t be contingent on companies that can restrict usage or hide things in the code. It’s not strictly required, but it’s certainly relevant.