The "source data" is allegedly 8 trillion tokens. You can't just distribute that like its source code.
The "binary" is the transformers python code, which in this case is essentially llamav2.
Now, the documentation for this model is inexcusably poor. Hot dropping random code on a git repo without one drop of human language would be similarly "open source," but its bad practice, and unfortunately this is the standard in AI Research Land.
I'm not sure open source applies to actual models. Models aren't human readable, so it's closer to a binary blob. It would apply to the training code and possibly data set.
Llama2 is a binary blob pre-trained model that is useful and is licensed in a fairly permissive way, and that's fine.
The code is open source, the model is a data file that the open source code operates on. It's similar to engine recreations for old games (OpenRCT, OpenTTD) that use original, proprietary assets to play the games with their open source engines.
Similar to those games, anyone is also able to distribute their own open data files if they so wish
It's unlikely anyone actually will start training an open source AI model from scratch because doing so costs insane amounts of money, but the same can be said about the many hours of work recreating game assets can take for open source game engines.
The inference and training software used to run the model are open source, the concrete model -- that is, the thing for which the weights are the object code -- is not.
The concrete model is free-to-use closed source, which is better than an undisclosed blob hiding behind a SaaS service, but still not open source.
It's also good that the inference and training code are open source, even though the training data and configuration is not.
Even the checkpoints are provided - for free! All you have to do is ask.
Someone at Facebook spent a ton of money to train a state of the art model, open sourced the code and even provides checkpoints free of charge, and you still complain? The level of entitlement is off the charts…
I agree with you to the extent that yeah technically it's not open source because the data is not known. But for these foundation models like Llama, the model structure is obviously known, pretty sure (didn't check) the hyperparameters used to train the model is known, the remaining unknown of data, it's pretty much the same for all foundation models, CommonCrawl etc. So replicating Llama once you know all that is a mechanical step and so isn't really closed source in a sense. Though probably some new term open something is more appropriate.
The real sauce is the data you fine tune these foundation models on, so RLHF, specific proprietary data for your subfield, etc. The model definition, basically Transformer architecture and a bunch of tricks to get it to scale are mostly all published material, hyper parameters to train the model are less accessible but also part of published literature; then the data and (probably) niche field you apply it to becomes the key. Gonna be fun times!
That's a bit like saying FOSS is open source only if a copy of the programmers is applied with the code.
Everything that has a source, has another source that has produced that source.
The algorithms behind creating LLMs are all published papers for all to read, the libraries (like TensorFlow) are themselves FOSS projects, and the data... is the open web for the most part.
The Wikipedia dump alone is more than enough to get a very decent LLM shaped up.
How an LLM is produced IS NO SECRET. It's just that to produce it you need millions (or for the more sophisticated ones: billions) in data center fees / power / GPU to train the model. So if the training scripts were included, you still can't make a LLama model yourself at home.
>to make the code useful you need data and that data might be free but not open sourced under any license
To use something like Meta’s open sourced LLAMA2 model you don’t need the data. The model is self contained. It’s a compressed lossy form of all the data it was trained on.
The weights allow you to continue its training with new data of your choosing.
> The source code would include everything needed to train that model and reproduce it.
You know these models are trained on internet scrape which contains copyrighted content, so the dataset can't be open sourced. It's either this or bad models.
The code will by default auto-download it during the build process. It's about 800 kbytes, which seems very reasonable for something that will reduce the generated code size by gigabytes for a large codebase.
Note that the open-source-ness is dubious... They say it was trained using an optimizer which isn't opensource, and that the results are significantly better ('more generalizable') than using the open source one. In my view, if the code that makes a binary blob isn't opensource, then the project isn't opensource...
Evidence so far would suggest that it's closed source.
The archive downloaded by the installer script does contain some source code, but it's mostly "generic" TensorFlow code, with some Python stubs that call off to native libraries (as you'd expect). It seems like all of the ML Compute stuff is contained within pre-compiled libraries (with some header files provided), but no source code.
I could be wrong here, and it might be that the intention is to open source the ML Compute components, but I don't think that's been done yet.
I did not expect such a response to my post. Thank you for making the issue, and I updated it with a MIT licence.
In my defense, this is a repo I cobbled together on the side for my AI group. I did not expect it to get a lot of attention outside the group. Also, the code I've provided is more for educational purposes. I didn't really see people using the code in other open source projects. In respect to the data, I provided links to the public sources.
You're right. Either way it's impossible to recreate Llama 2 without the data set so perhaps "free to use model" is a better description than "open source model"
Horrible framing. AI is not learning from code. The model is a function. The AI is a derivative work of its training material. They built a program based on open source code and failed to open source it.
They also built a program that outputs open source code without tracking the license.
This isn't a human who read something and distilled a general concept. This is a program that spits out a chain of tokens. This is more akin to a human who copied some copywritten material verbatim.
> Model weights can't really be described as source code though. The equivalence isn't exact, but I'd describe the weights more as the compiled binary, with the training data & schedule being the source
I think this is a really interesting discussion! I see where you're coming from, but I'm minded to disagree in part.
For one, I think it's possible to release model weights under a liberal licence, yet train on proprietary data. (ChatGPT is trained on oodles of proprietary data, but that doesn't limit what OpenAI do with the model). Normally, obviously, the binary is a derivative work of the source.
Also, the GPL defines source code as 'the preferred form for modification'. I don't disagree that model weights are a black box. But we've seen loads of fine tuning of LLaMA, so we don't always need to train models from scratch.
Ideally, of course, having both unencumbered training data and model weights would be perfect. But in the interim, given I don't have that million dollars, I'll settle for the latter.
The "binary" is the transformers python code, which in this case is essentially llamav2.
Now, the documentation for this model is inexcusably poor. Hot dropping random code on a git repo without one drop of human language would be similarly "open source," but its bad practice, and unfortunately this is the standard in AI Research Land.
reply