Oh, I do, they’re just generally not quite the best available/most popular for hobbyists. Some I can find quickly enough are Pythia and OpenLLaMA, and some of the RedPajama models Together.ai trained on their own RedPajama dataset (which is freely available and described). (Also the mentioned Falcon and MPT, as well as StableLM. You might have to get into the weeds to find out how much of the data processing step is replicable.)
(It’s going to be expensive to replicate any big pretrained model though, and possibly not deterministic enough to do it perfectly; especially since datasets sometimes adjust due to removing unsafe data, the recipes for data processing included random selection and shuffling from the datasets, etc. Smaller examples where people have fine-tuned using the same recipe coincidentally or intentionally have gotten identical model weights though.)
Thanks—Redpajama definitely looks like it fits the bill, but it shouldn’t need to bill itself as making “fully-open, reproducible models,” since that’s what “open source” is already supposed to mean. (Unfortunately, the largest model they have is 7B.)
Oh, I do, they’re just generally not quite the best available/most popular for hobbyists. Some I can find quickly enough are Pythia and OpenLLaMA, and some of the RedPajama models Together.ai trained on their own RedPajama dataset (which is freely available and described). (Also the mentioned Falcon and MPT, as well as StableLM. You might have to get into the weeds to find out how much of the data processing step is replicable.)
(It’s going to be expensive to replicate any big pretrained model though, and possibly not deterministic enough to do it perfectly; especially since datasets sometimes adjust due to removing unsafe data, the recipes for data processing included random selection and shuffling from the datasets, etc. Smaller examples where people have fine-tuned using the same recipe coincidentally or intentionally have gotten identical model weights though.)
Thanks—Redpajama definitely looks like it fits the bill, but it shouldn’t need to bill itself as making “fully-open, reproducible models,” since that’s what “open source” is already supposed to mean. (Unfortunately, the largest model they have is 7B.)