I think they are dodging unclear legal issues surrounding certain steps of the model-building process while being as open as possible with the components given that constraint, allowing downstream users to make their own legal risk vs. effort choices.
The data that the models are built from comes from the public. It'll be a long fight if Big Tech decides that the public's data, however referenced in whatever model, is no longer the public's.
They're fully welcome to keep proprietary the data that is already proprietary. But so far, that's not been their source.
I agree that high-quality datasets, usually proprietary, are key to good model performance. Microsoft's Phi-2 model for example is punching way above its weight thanks to being fed high-quality textbooks and encyclopedias. But that's not a total showstopper. To use those, deals can be arranged since many textbooks come from just a few publishing houses. Heavyweights like Microsoft can do that. And a comparatively small model might actually be less likely to reproduce content versatim even if prompted.
I am more curious whether researchers and private persons will be permitted to continue uploading models on Huggingface.
The model creators/maintainers will eventually - and probably already does - include bad actors who will train a model to get around whatever safeguards the big players attach to their models.
This is like asking the New York Times to carefully include fact checking metadata to ensure there's no fake news on Reddit. Sure, you'll have fact checking for the NYT, but it's not the source of the problem.
That’s not the same as giving the model to someone and allowing them to build tools with AI powering it, or the development of alternative models (which is what they’re trying to stifle). It’s less about transparency and more about putting the tools in as many hands as possible
> More to the point it's clear from watching the activity in the open source community at least that many of them don't want aligned models. They're clambering to get all the uncensored versions out as fast as they can. They aren't that powerful yet, but they sure ain't getting any weaker.
There simple explanation for this. Getting the models which small startup cannot afford to develop and train is the only way to move forward. To get some investments, or before spending their own money, they need a proof of concept at least. Besides, working models are a good learning resource.
The actual quality of the currently public models is why they decided to release them.
The worst case scenario of the currently public models is why they want to take it slow.
It's like the lottery in one of the episodes of the TV show Sliders: you probably won't win, but if you do win, you die.
Unfortunately, most people are really bad at grokking probably, and, by extension, risk, especially in scenarios like this.
> Bad actors will have enough money/lack scruples to train their own models or to steal your best ones regardless how "impact conscious" your company it.
Indeed, totally correct.
But this is also on the list titled "why AI alignment is hard and why we need to go slow because we have not solved these things yet".
Saying "it doesn't matter if we keep this secret, someone else will publish anyway" is exactly the same failure mode as "it doesn't matter if we keep polluting, someone else will pollute anyway".
So people are pretty down on the licensing for this. I agree it is less than ideal, and most certainly isn't open source.
But it is a tremendous step forward. Previously this type of data was extremely hard to find, and even if you had budget it was a long, slow process finding someone who would sell you something to test hypothesis about what models could work.
With this you can work on building models that work, demo them as much as you like, and find a source that lets you train for commercial outcomes. That's very useful. Less useful than properly open data, but useful none-the-less.
People are building and releasing models. There's active research in the space. I think that's great! The attitude I've seen in open models is "use this if it works for you" vs any attempt to coerce usage of a particular model.
To me that's what closed source companies (MSFT, Google) are doing as they try to force AI assistants into every corner of their product. (If LinkedIn tries one more time to push their crappy AI upgrade, I'm going to scream...)
Despite the ethical concerns, it would be very useful if Figma offered an option for large organisations to train an organisation-level model that does not share data with the rest of the world. I believe the opt-in rate would be much higher.
The people who use these open models are doing it because they find them useful. That's already plenty of benefit for them. The "ecosystem play" of benefiting from volunteers' mods to open models is certainly a benefit for the model trainer. This fact doesn't eliminate the benefit of people being able to use good models.
As I understand it they have the input data, but next up they are creating the model. I could make a joke about drawing an owl ... but that would be a bit mean. I am really glad people are working on this.
I wonder... who is paying? Will there be restrictions like ethics clauses and suchlike. Not necessarily a bad thing if they do. Will there be restrictions on commercial use.
Because more than making a tool I wanted to strike up a conversation about what unfettered access to models like this will mean and how we should handle it.
Legal butt-covering in two forms, I'd guess. They said a lot in detail about early models training data sources that could be used against them for spurrious but costly legal attacks. And the old models aren't quite as heavily aligned/censored as the new ones.
reply