Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Google pays publishers to test AI tool that scrapes sites to craft new content (www.adweek.com) similar stories update story
5 points by vincent_s | karma 5081 | avg karma 5.63 2024-02-28 11:44:11 | hide | past | favorite | 72 comments



view as:

One friend once told me, even a few years before ChatGPT, that they use AI to generate daily horoscopes in his online newspaper. Using far more simple models.

Nobody ever noticed.


Well horoscopes do tend to be general enough they might as well apply to nearly everybody, so I'd believe it.

Didn't they notice the accuracy of the predictions went down?

/s


I did. I was supposed to be frozen solid underwater, and burned to a crisp. None of that happened.

On average so...

oneday i had a fortune cookie tell me nasty like that

"complete loss, no prosperity, utter ruin. " !?


The AI might as well look like this

   old_horoscopes[rand() % N]
and nobody will ever notice.

Context “mixers” that put together copy and rearranged sentences were popular in black hat SEO for years. The early versions of ChatGPT have been used to generate SEO and marketing copy since ChatGPT 2.

This has been going on for a long time. The same low-quality sites that you ignored 2 years ago are just getting slightly better content that you’ll never read.


Text generation using word/sentence/phrase-level ad-libs is so simple its a common homework assignment for first-year programmers. You really don't need much.

There's even tools like Tracery that let you construct grammars and vocabularies as a form of textual generative art.


20 years ago (HTML was still king, Javascript hardly used) I worked on a horoscope product. The data provider had all daily/weekly/monthly horoscopes of the coming year ready. Whatever their process, I assume manual, must have been straight forward.

"Handmade" or "manmade" is about to have a whole new relevance on the internet.

I'm sure I'm in the minority, but the mere suggestion that some purportedly creative "content" (ugh, that word sounds like saying "sausage meat") is AI generated makes me completely lose interest. Soon enough that might make me lose interest in whole swathes of the internet, but I can't deny that's been an ongoing process regardless for a while anyways.


Nothing wrong with AI generated content if it's info I need and it's factually correct.

The problem I have is that it's usually junk fluff pieces with way too much text for SEO optimization.

I feel anything computer generated should be short bullet points or comparison tables, not long form text as it's just not good at that.

For example think of a blog post where the author walks you through its thought process, trial and error and ultimate solution. That is very humane and relatable, something an LLM can approximate/copy but it will never be genuine. I'd prefer in that case just the raw solution, not a fake and deceptive walkthrough.


I fully expect the norm to become some LLM layer which takes such fluff and condenses it into such fact dense text, maybe as a browser plugin.

If that happens the next generation of SEO will be creating content designed to bubble up through that layer.

Of course! But at least there will be less text to skim through.

except, translated to other languages with zero effort by the publisher.

I agree with the sentiment, at the same time commercial drivers will fill the pipes with low-quality content quickly, because $MONEY. The literate, subtle and artsy are once again buried in commercial content. The only way forward is rebuilding chains of editorial recommendation. Ordinary browsing will become a junk-fest. It appears the pandora's box is opened once again.

What was really jarring recently was to browse literate, subtle and artsy content only to be repeatedly interrupted by genuinely offensive and/or borderline scam content ads, then back to the chosen content. real.


Interesting, for me this is only true as long as the AI generated content is worse.

There's some things I really care about the human experience for, but much content it just matters to me if it's a joy to ride.


Yeah, right now most human written content isn't that good, either. Quality writing has largely been abandoned for verbosity and formality, offends no one, and often lacks substance or that human touch. I'm guessing AI content will be about as flavorless. But time will tell.

Yeah. It is difficult see AI supplanting humans for the things you go outside for, but any human involvement on the internet has always just been an implementation detail.

After their recent over-censorship of Gemini I look forward to Google's fair, even handed, completely unbaised approach to the delicate topic of race relations and tensions as the world's population slowly seems to descend into more hatred for each other over silly differences. /s

agreed, but i see this as an opportunity for creative people to make art and communicate things that a computer can't. i'm skeptical that an AI can write a great novel about the matters of the heart and feels like it's alive (not without trying). i think, in the age of generic AI content, that some people will be seeking authenticity even more than we do now.

Creative people hate this trend and don’t work that way (see possible optimizations via the grateful computer overlords). Source - know a lot of artists and writers.

The creative classes already trusted this stuff the first round in 2008+, and it ate their lunch.


i think you're missing what i said.

i'm a writer. the only thing i can do is keep making my art with the insight of the human condition, an insight that an ai can try to mimic but will never be able to replace. and i think that people will try to seek out human created art because they're bored by AI mimicry, they'll want real connection with a piece of art or a novel etc.

the real ones will continue to make their art regardless of market success, a payday, or technological changes


I can really understand the fear of having your paychecks stop. Many artists will continue to be artists, but some found jobs doing what they love and there is a shakeup going. Everything will settle at some point, but the transition will be rough for many.

you're right. it's hard for me to be sympathetic towards that though, because i chose not to make writing my day job.

but my partner is a journalist and we're not over here freaking out. perpetual lay offs have always been a problem in media during our lifetime. we'll adapt.


And new ones are created. I started selling AI art but never once sold a human made art. Art is evolving and so must artists.

Yea all creative people are the same. No creative person would ever be interested in new mediums of expression.

The truly creative are just looking to do the same things over and over lol.


So as long as creative people can overturn the momentum of the largest corporations sinking billions of financial and political capital into making AI synonymous with computing, we'll be OK? /s


If the content is actually intended to be informative, then it's a poisoned well unless it's been manually vetted for hallucination. If it's entertainment though, then if it works it works.

We may be in the minority, but there are plenty of people who share our opinion on this.

Whole swaths of the internet became uninteresting to me a long time ago. AI promises to make the rest of it worthless to me as well.

The fact that AI models are being trained on public-facing web content without the consent of the authors has already driven me and others to remove all our content from the public-facing web, as there is no other way to protect ourselves. That eliminates a large part of the value of the web right there.


What strategies did you use? Other than quitting social media, blocking crawlers and enforcing an "email-wall" is what I just started to do with my personal website. I feel that it's hard to stay on top of "pretty please" robots.txt requests, especially with new, undeclared ones. I prefer word-of-mouth diffusion of links, skipping the algorithmic promotion game altogether. I decided that I won't ever rely on my online writings, recordings, or photography as main income again (I did that once: it worked, but it felt gradually poisoned by the need to build engagement.) But it feels weird that we got to the point where I have to shield a blog from bulimic information machines that pollute valuable knowledge, and I wonder what the right strategy is.

> What strategies did you use?

I see no other strategies that could possibly work. All it takes is one crawler to get through or ignore your defenses and your work has been used to help train a model. At that point, the horse is out of the barn. Any partial defense seems the same as no defense to me on this point.

But it depends on what you're trying to protect. My concern isn't actually that my content may get reproduced through an LLM. I just don't want any of my work to be used to develop or improve these models, period.


Yep, we agree on ethical stands. It's not that the uniqueness of my work is lost or anything like that. It's the choice of not feeding.

You might be in the minority in the context of this site's bubble but I don't think you're in the minority overall.

Best solution I've seen so far is to make the publisher liable for "generative" content, if they can't point to a human creator. Also, prevent any copyright protection unless you can point to a transformative human creator. There might be people "faking" the creation, but then it's not a corp, just a human.

I was thinking just the other day that a recognisable 'AI Free' badge on sites that want to assure people that none of their content is AI generated could gain a lot of traction.

If the badge ever gained traction, every garbage AI site would slap one on.

The web has been headed down this road for years. We've all seen reputable blogs sink to publishing "Here are our favourite smart speakers for Xmas", to keep the lights on. What is that but a soulless piece of data-driven sponsored content pushing products with the highest affiliate payout rates?

Google only cares about these types of sites now, because they are the ideal customer for Google Ads. They churn out content constantly, they advertise heavily, with Google getting a cut of every click and impression.

And let's face it, YouTube is basically QVC for electronics, toys and makeup, all multi-billion $ industries that Google is happy to slap ads on top of.


The era of artisanal internet is upon us! Everything AI and a couple of folks still doing it like on the good old days haha

I haven't touched the spice in years, but I foresee some kind of content butlerian jihad very soon. It's already started on music, where YouTube is filled with ai-generated 'songs' from e.g. Sia, si much that I stopped using YouTube but for recording of live shows, and I guess soon this will be enshittified too, bye YouTube.

Maybe very little in the scheme of things but not a 'view' or click from me.


What happens once the stipend goes away? What happens when the tool stops being “free”? I know these publishers are desperate for lifelines at the moment but I hope they are thinking ahead to the time that Google no longer needs them or their content.

More like what if long term google cuts out the publishers. I mean when someone googles X, show or generate the article now. They only need the publishers now to get a human feedback loop.

They still want social and kagi traffic. But the lack of google traffic will turn on them.

There is exactly one form of organization that wants the sort of Gleichschaltung that comes from this sort of tool, and it is not funded voluntarily.

Google is doing the thing that it punishes other publishers for doing? Whaaaaa? /s

I don't really see how this isn't just content theft at this point. Pointing at "inspiration sites" and just rewording their content feels pretty scummy at best.

At what point are content creators and publishers going to be paid for or given the option to block AI scraping tools using their material so blatantly to generate income for other companies and publishers?


You wouldn't download a car.

> Pointing at "inspiration sites" and just rewording their content

Sounds like Reddit, minus the AI part. Any time someone might be on the verge of an original thought there it gets shot down with something akin to "[citation needed]", followed by strong social pressure for the user to stop participating if they can't manage just rewording "inspiration sites".

What's special about AI here?


Scope

What kind of scope?

Why are they funding this? I don't want to read AI-generated news.

There are some sites that don't use Adsense.

I guess Google want adfarms with adsense copying their content?


There is no way this behaviour would be illegal, right?

> I don't want to read AI-generated news.

I didn't think you were right. But you're right.

create aggregated content more efficiently by indexing recently published reports generated by other organizations, like government agencies and neighboring news outlets, and then summarizing and publishing them as a new article.


Would anyone? They don't care. It's so beyond the pale now it's almost gone into satire.

What is being scraped:

using factual content from public data sources—like a local government’s public information office or health authority.

edit: what is also being scraped:

like government agencies and neighboring news outlets

I yield to the pessimists.


> I yield to the pessimists.

To give you an optimistic perspective: I don't think it qualifies as pessimism to point out obvious issues with some of the "approaches" currently developing in this field. If enough people do it, maybe something will change for the better?


I don’t think this is a new problem. It’s was solved with copyright laws. And you can now enjoy original output. The law just needs to catch up (and hopefully quickly).

I wonder who curates the AI blocklist, and how skipping sites will affect the internal bias.

Can't wait for google to die out. Glad to see they are on their way.

Ha Ha! This is the exact sentiment HN had about Meta/Facebook about 18 months ago.

This entire thread is a reminder as to how out of touch and naive most HN commentary are when it is outside their area of a very narrow field of expertise.


Facebook usage/imagine has dropped. Fights with congress. Average person moving to other socials. Mark was planning a run for president which is dead. Metaverse stalled. Plenty of money, ad revenue and other social networks they control.

IBM is still around. It took yahoo forever to die. The corpse of altavista and ask jeeves still make money.



> It’s hard to argue that stealing people’s work supports the mission of the news. This is not adding any new information to the mix.

Not so different from current publications. Most human written news is also based on repackaging Twitter and press agencies. Few journalists actually gather original content. Should be OK to copy news information from any source without consent, its fact based not creative data.


Ironically I had a side project that essentially summarized stories from RSS feeds and reposted them on my own blog. I provided full attribution and links to the original articles.

Applied for AdSense and was denied as a low quality website. Not that I expected to make much anyway.

Now basically the same thing is an official Google project.


It's cool when they do it, it's a problem when you do it

Podcast: Why Google Is Shit Now

This week we go long on Google News and AI, and why Google Search is worse now. We also discuss a phone spy tool that can monitor billions.

https://www.404media.co/podcast-why-google-is-shit-now-404-m...


This seems like a great way to poison the well. But maybe Google already plans on search dying out and they are scrambling to find the next big thing?

I could see this going horribly wrong, but honestly a LLM that scrapes local government data sources, police blotters, etc, highlights important or interesting information and makes it understandable sounds pretty great.

Legal | privacy