Ann comments on “Open Source AI” isn’t Open Source

Ann 15 Feb 2024 13:25 UTC
3 points
2
Have you met Mistral, Phi-2, Falcon, MPT, etc … ? There are plenty of freely remixable models out there; some even link to their datasets and recipes involved in processing them (though I wouldn’t be surprised if some relevant thing got left out because no one researched that it was relevant yet).

Though I’m reasonably sure Llama license isn’t preventing viewing the source (though of course not the training data), modifying it, understanding it and remixing it. It’s a less open license than others, but Facebook didn’t just free-as-in-beer release a compiled black box you put on your computer and can never change; research was part of the purpose, and needs to do that. It’s not the best open source license, but I’m not sure if being a good example of something is required to meet the definition.
- Davidmanheim 15 Feb 2024 13:57 UTC
  12 points
  8
  Parent
  “Freely remixable” models don’t generally have open datasets used for training. If you know of one, that’s great, and would be closer to open source. (Not Mistral. And Phi-2 is using synthetic data from other LLMs—I don’t know what they released about the methods used to generate or select the text, but it’s not open.)
  But the entire point is that weights are not the source code for an LLM, they are the compiled program. Yes, it’s modifiable via LoRA and similar, but that’s not open source! Open source would mean I could replicate it, from the ground up. For facebook’s models, at least, the details of the training methods, the RLHF training they do, where they get the data, all of those things are secrets. But they call it “Open Source AI” anyways.
  - Ann 15 Feb 2024 14:14 UTC
    3 points
    0
    Parent
    Oh, I do, they’re just generally not quite the best available/most popular for hobbyists. Some I can find quickly enough are Pythia and OpenLLaMA, and some of the RedPajama models Together.ai trained on their own RedPajama dataset (which is freely available and described). (Also the mentioned Falcon and MPT, as well as StableLM. You might have to get into the weeds to find out how much of the data processing step is replicable.)
    
    (It’s going to be expensive to replicate any big pretrained model though, and possibly not deterministic enough to do it perfectly; especially since datasets sometimes adjust due to removing unsafe data, the recipes for data processing included random selection and shuffling from the datasets, etc. Smaller examples where people have fine-tuned using the same recipe coincidentally or intentionally have gotten identical model weights though.)
    - Davidmanheim 15 Feb 2024 15:49 UTC
      2 points
      0
      Parent
      Thanks—Redpajama definitely looks like it fits the bill, but it shouldn’t need to bill itself as making “fully-open, reproducible models,” since that’s what “open source” is already supposed to mean. (Unfortunately, the largest model they have is 7B.)