Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

brucethemoose2 2023-09-27 13:56:05 | [–] update item (on: Mistral 7B )

The "source data" is allegedly 8 trillion tokens. You can't just distribute that like its source code.

The "binary" is the transformers python code, which in this case is essentially llamav2.

Now, the documentation for this model is inexcusably poor. Hot dropping random code on a git repo without one drop of human language would be similarly "open source," but its bad practice, and unfortunately this is the standard in AI Research Land.

sort by:

page size:

api | karma 31631 | avg karma 2.2 | 2023-07-21 11:30:35 | [–] similar comments (on: In the LLM space, "open source" is being used to mean "downloadable weights" )

I'm not sure open source applies to actual models. Models aren't human readable, so it's closer to a binary blob. It would apply to the training code and possibly data set.

Llama2 is a binary blob pre-trained model that is useful and is licensed in a fairly permissive way, and that's fine.

jeroenhd | karma 24884 | avg karma 4.06 | 2022-11-24 03:12:02 | [–] similar comments (on: Stable Diffusion 2.0 )

The code is open source, the model is a data file that the open source code operates on. It's similar to engine recreations for old games (OpenRCT, OpenTTD) that use original, proprietary assets to play the games with their open source engines.

Similar to those games, anyone is also able to distribute their own open data files if they so wish It's unlikely anyone actually will start training an open source AI model from scratch because doing so costs insane amounts of money, but the same can be said about the many hours of work recreating game assets can take for open source game engines.

MacsHeadroom | karma 2958 | avg karma 2.23 | 2023-04-10 01:07:51 | [–] similar comments (on: The LLama Effect: Leak Sparked a Series of Open Source Alternatives to ChatGPT )

The model isn't code to a new model trained on it, it's training data; just like the pirated torrent site Books3 dataset Facebook used to train LLaMA.

The training code is Apache 2.0 licensed so it can be copied and modified freely, including for commercial purpoes. https://github.com/facebookresearch/llama

dragonwriter | karma 118260 | avg karma 2.17 | 2023-11-26 17:30:20 | [–] similar comments (on: Yann LeCun on why AI must be open source [video] )

> LLAMA2 is very much open source.

The inference and training software used to run the model are open source, the concrete model -- that is, the thing for which the weights are the object code -- is not.

The concrete model is free-to-use closed source, which is better than an undisclosed blob hiding behind a SaaS service, but still not open source.

It's also good that the inference and training code are open source, even though the training data and configuration is not.

p1esk | karma 6022 | avg karma 1.5 | 2023-02-24 18:18:47 | [–] similar comments (on: Meta rolls out AI language model LLaMA )

Did you even look at the repo before commenting? The model code is literally open source: https://github.com/facebookresearch/llama/blob/main/llama/mo...

Even the checkpoints are provided - for free! All you have to do is ask.

Someone at Facebook spent a ton of money to train a state of the art model, open sourced the code and even provides checkpoints free of charge, and you still complain? The level of entitlement is off the charts…

kelipso | karma 902 | avg karma 1.15 | 2023-05-04 17:28:27 | [–] similar comments (on: Google “We have no moat, and neither does OpenAI” )

I agree with you to the extent that yeah technically it's not open source because the data is not known. But for these foundation models like Llama, the model structure is obviously known, pretty sure (didn't check) the hyperparameters used to train the model is known, the remaining unknown of data, it's pretty much the same for all foundation models, CommonCrawl etc. So replicating Llama once you know all that is a mechanical step and so isn't really closed source in a sense. Though probably some new term open something is more appropriate.

The real sauce is the data you fine tune these foundation models on, so RLHF, specific proprietary data for your subfield, etc. The model definition, basically Transformer architecture and a bunch of tricks to get it to scale are mostly all published material, hyper parameters to train the model are less accessible but also part of published literature; then the data and (probably) niche field you apply it to becomes the key. Gonna be fun times!

3cats-in-a-coat | karma 2327 | avg karma 2.06 | 2023-12-02 14:35:45 | [–] similar comments (on: LLMs are a revolution in open source )

That's a bit like saying FOSS is open source only if a copy of the programmers is applied with the code.

Everything that has a source, has another source that has produced that source.

The algorithms behind creating LLMs are all published papers for all to read, the libraries (like TensorFlow) are themselves FOSS projects, and the data... is the open web for the most part.

The Wikipedia dump alone is more than enough to get a very decent LLM shaped up.

How an LLM is produced IS NO SECRET. It's just that to produce it you need millions (or for the more sophisticated ones: billions) in data center fees / power / GPU to train the model. So if the training scripts were included, you still can't make a LLama model yourself at home.

menzoic | karma 956 | avg karma 2.89 | 2023-11-26 17:16:05 | [–] similar comments (on: Yann LeCun on why AI must be open source [video] )

>to make the code useful you need data and that data might be free but not open sourced under any license

To use something like Meta’s open sourced LLAMA2 model you don’t need the data. The model is self contained. It’s a compressed lossy form of all the data it was trained on.

The weights allow you to continue its training with new data of your choosing.

visarga | karma 12425 | avg karma 1.65 | 2024-02-28 15:04:24 | [–] similar comments (on: Qualcomm has open sourced more than 80 AI models )

> The source code would include everything needed to train that model and reproduce it.

You know these models are trained on internet scrape which contains copyrighted content, so the dataset can't be open sourced. It's either this or bad models.

londons_explore | karma 35497 | avg karma 2.72 | 2022-07-07 02:02:05 | [–] similar comments (on: Google ML Compiler Inlining Achieves 3-7% Reduction in Size )

Looks like they do have a pretrained model:

https://github.com/google/ml-compiler-opt/releases/tag/inlin...

The code will by default auto-download it during the build process. It's about 800 kbytes, which seems very reasonable for something that will reduce the generated code size by gigabytes for a large codebase.

Note that the open-source-ness is dubious... They say it was trained using an optimizer which isn't opensource, and that the results are significantly better ('more generalizable') than using the open source one. In my view, if the code that makes a binary blob isn't opensource, then the project isn't opensource...

JosephRedfern | karma 2123 | avg karma 3.29 | 2020-12-06 11:24:12+00:00 | [–] similar comments (on: Hardware-Accelerated TensorFlow and TensorFlow Addons for macOS 11.0 )

Evidence so far would suggest that it's closed source.

The archive downloaded by the installer script does contain some source code, but it's mostly "generic" TensorFlow code, with some Python stubs that call off to native libraries (as you'd expect). It seems like all of the ML Compute stuff is contained within pre-compiled libraries (with some header files provided), but no source code.

I could be wrong here, and it might be that the intention is to open source the ML Compute components, but I don't think that's been done yet.

drexlspivey | karma 4484 | avg karma 3.46 | 2023-07-18 15:18:22 | [–] similar comments (on: Llama 2 )

How do you remotely open source a binary blob? Do you want them to post their training code and dataset?

jsight | karma 6940 | avg karma 1.91 | 2024-04-04 21:47:46 | [–] similar comments (on: Reaching LLaMA2 Performance with 0.1M Dollars )

It was trained with "1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc."

I guess it is good that they mentioned some of it, but yeah, that isn't exceptionally helpful when making claims of it being 100% open source.

I'm not sure why they feel the need to be so secretive if all of the sources are open.

BucketSort | karma 1626 | avg karma 4.59 | 2017-12-17 17:55:54+00:00 | [–] similar comments (on: A workshop for scientific computing in Python )

I did not expect such a response to my post. Thank you for making the issue, and I updated it with a MIT licence.

In my defense, this is a repo I cobbled together on the side for my AI group. I did not expect it to get a lot of attention outside the group. Also, the code I've provided is more for educational purposes. I didn't really see people using the code in other open source projects. In respect to the data, I provided links to the public sources.

Diris | karma 103 | avg karma 1.66 | 2022-08-16 02:47:02 | [–] similar comments (on: Open-source rival for OpenAI’s DALL-E runs on your graphics card )

The code is open source, the models are not.

brucethemoose2 | karma 7874 | avg karma 2.4 | 2023-07-18 12:27:57 | [–] similar comments (on: Llama 2 )

Is a truly open source 2 trillion token model even possible?

Even if Meta released this under Apache 2.0, there's the sticky question of the training data licenses.

jasonwcfan | karma 294 | avg karma 3.23 | 2023-07-20 17:00:43 | [–] similar comments (on: Show HN: RAGstack – private ChatGPT for enterprise VPCs, built with Llama 2 )

You're right. Either way it's impossible to recreate Llama 2 without the data set so perhaps "free to use model" is a better description than "open source model"

carom | karma 4857 | avg karma 7.9 | 2022-10-17 18:33:23 | [–] similar comments (on: GitHub Copilot, with “public code” blocked, emits my copyrighted code )

Horrible framing. AI is not learning from code. The model is a function. The AI is a derivative work of its training material. They built a program based on open source code and failed to open source it.

They also built a program that outputs open source code without tracking the license.

This isn't a human who read something and distilled a general concept. This is a program that spits out a chain of tokens. This is more akin to a human who copied some copywritten material verbatim.

orra | karma 2494 | avg karma 2.74 | 2023-05-17 16:30:35 | [–] similar comments (on: Stability AI Releases StableStudio, the Open-Source Future of DreamStudio )

> Model weights can't really be described as source code though. The equivalence isn't exact, but I'd describe the weights more as the compiled binary, with the training data & schedule being the source

I think this is a really interesting discussion! I see where you're coming from, but I'm minded to disagree in part.

For one, I think it's possible to release model weights under a liberal licence, yet train on proprietary data. (ChatGPT is trained on oodles of proprietary data, but that doesn't limit what OpenAI do with the model). Normally, obviously, the binary is a derivative work of the source.

Also, the GPL defines source code as 'the preferred form for modification'. I don't disagree that model weights are a black box. But we've seen loads of fine tuning of LLaMA, so we don't always need to train models from scratch.

Ideally, of course, having both unencumbered training data and model weights would be perfect. But in the interim, given I don't have that million dollars, I'll settle for the latter.

Legal | privacy