Hacker Read

ovi256 · 2022-10-17 03:50:32

Because the ethical opt-in model builders are still working on putting together their cleanly sourced dataset.

dragonwriter | karma 118260 | avg karma 2.17 · | 2023-03-24 12:16:22

I think they are dodging unclear legal issues surrounding certain steps of the model-building process while being as open as possible with the components given that constraint, allowing downstream users to make their own legal risk vs. effort choices.

lpasselin | karma 173 | avg karma 1.78 · | 2021-07-08 01:02:24

The models they are using are probably already public.

ActorNightly | karma 1536 | avg karma 1.08 · | 2023-11-19 02:39:52

You think open sourcing your models isn't ethical?

ksaj | karma 2486 | avg karma 0.81 · | 2023-05-14 01:55:03

The data that the models are built from comes from the public. It'll be a long fight if Big Tech decides that the public's data, however referenced in whatever model, is no longer the public's.

They're fully welcome to keep proprietary the data that is already proprietary. But so far, that's not been their source.

reply

samus | karma 1894 | avg karma 1.18 · | 2023-12-30 10:11:03

I agree that high-quality datasets, usually proprietary, are key to good model performance. Microsoft's Phi-2 model for example is punching way above its weight thanks to being fed high-quality textbooks and encyclopedias. But that's not a total showstopper. To use those, deals can be arranged since many textbooks come from just a few publishing houses. Heavyweights like Microsoft can do that. And a comparatively small model might actually be less likely to reproduce content versatim even if prompted.

I am more curious whether researchers and private persons will be permitted to continue uploading models on Huggingface.

reply

sdenton4 | karma 9649 | avg karma 3.54 · | 2022-12-20 19:36:32

The model creators/maintainers will eventually - and probably already does - include bad actors who will train a model to get around whatever safeguards the big players attach to their models.

This is like asking the New York Times to carefully include fact checking metadata to ensure there's no fake news on Reddit. Sure, you'll have fact checking for the NYT, but it's not the source of the problem.

reply

fnordpiglet | karma 11508 | avg karma 2.98 · | 2023-05-31 17:26:23

That’s not the same as giving the model to someone and allowing them to build tools with AI powering it, or the development of alternative models (which is what they’re trying to stifle). It’s less about transparency and more about putting the tools in as many hands as possible

two_in_one | karma 266 | avg karma 0.95 · | 2023-05-10 18:31:22

> More to the point it's clear from watching the activity in the open source community at least that many of them don't want aligned models. They're clambering to get all the uncensored versions out as fast as they can. They aren't that powerful yet, but they sure ain't getting any weaker.

There simple explanation for this. Getting the models which small startup cannot afford to develop and train is the only way to move forward. To get some investments, or before spending their own money, they need a proof of concept at least. Besides, working models are a good learning resource.

reply

ben_w | karma 20467 | avg karma 1.69 · | 2022-10-23 10:56:16

The actual quality of the currently public models is why they decided to release them.

The worst case scenario of the currently public models is why they want to take it slow.

It's like the lottery in one of the episodes of the TV show Sliders: you probably won't win, but if you do win, you die.

Unfortunately, most people are really bad at grokking probably, and, by extension, risk, especially in scenarios like this.

> Bad actors will have enough money/lack scruples to train their own models or to steal your best ones regardless how "impact conscious" your company it.

Indeed, totally correct.

But this is also on the list titled "why AI alignment is hard and why we need to go slow because we have not solved these things yet".

Saying "it doesn't matter if we keep this secret, someone else will publish anyway" is exactly the same failure mode as "it doesn't matter if we keep polluting, someone else will pollute anyway".

reply

nl | karma 29762 | avg karma 2.49 · | 2021-08-06 00:27:47

So people are pretty down on the licensing for this. I agree it is less than ideal, and most certainly isn't open source.

But it is a tremendous step forward. Previously this type of data was extremely hard to find, and even if you had budget it was a long, slow process finding someone who would sell you something to test hypothesis about what models could work.

With this you can work on building models that work, demo them as much as you like, and find a source that lets you train for commercial outcomes. That's very useful. Less useful than properly open data, but useful none-the-less.

reply

eli | karma 29331 | avg karma 3.14 · | 2023-07-06 18:08:24

The former seems very believable. And I bet a lot of the fine tuned models that are active are still part of prototypes or experiments.

I assume if you reach out they throw some credits at you

reply

toddmorey | karma 4800 | avg karma 5.39 · | 2024-03-27 17:13:39

People are building and releasing models. There's active research in the space. I think that's great! The attitude I've seen in open models is "use this if it works for you" vs any attempt to coerce usage of a particular model.

To me that's what closed source companies (MSFT, Google) are doing as they try to force AI assistants into every corner of their product. (If LinkedIn tries one more time to push their crappy AI upgrade, I'm going to scream...)

reply

meelford | karma 35 | avg karma 5.83 · | 2024-06-27 07:17:46

Despite the ethical concerns, it would be very useful if Figma offered an option for large organisations to train an organisation-level model that does not share data with the rest of the world. I believe the opt-in rate would be much higher.

simion314 | karma 7313 | avg karma 1.67 · | 2023-04-18 14:21:48

The models should be open source because they used open source data to create them, or scrapped content without an explicit granted permission.

pradn | karma 4976 | avg karma 4.37 · | 2024-02-26 19:58:46

The people who use these open models are doing it because they find them useful. That's already plenty of benefit for them. The "ecosystem play" of benefiting from volunteers' mods to open models is certainly a benefit for the model trainer. This fact doesn't eliminate the benefit of people being able to use good models.

quickthrower2 | karma 24182 | avg karma 1.71 · | 2023-04-17 17:37:49

As I understand it they have the input data, but next up they are creating the model. I could make a joke about drawing an owl ... but that would be a bit mean. I am really glad people are working on this.

I wonder... who is paying? Will there be restrictions like ethics clauses and suchlike. Not necessarily a bad thing if they do. Will there be restrictions on commercial use.

reply

greshake | karma 311 | avg karma 4.15 · | 2022-12-05 19:04:46

Because more than making a tool I wanted to strike up a conversation about what unfettered access to models like this will mean and how we should handle it.

superkuh | karma 13075 | avg karma 3.32 · | 2024-01-04 14:22:56

Legal butt-covering in two forms, I'd guess. They said a lot in detail about early models training data sources that could be used against them for spurrious but costly legal attacks. And the old models aren't quite as heavily aligned/censored as the new ones.

brookst | karma 8408 | avg karma 3.08 · | 2023-04-12 08:52:48

I don’t those state actors are releasing models as open and free though, as parent was looking for.