Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Various bits of the lowercase semantic web (and its related uppercase Semantic Web cousin) factors show through on Google results - mostly Google Rich Snippets, so reviews metadata. These mostly require human intervention to enable on a site-by-site basis (sometimes by the site owner, sometimes by humans in Google itself).

The deeper problem of the semantic web is that it's meta data, which tends not to be visible to visitors. And because of that, it's easy to forget / overlook / incomplete, and nefariously allows SEO over-optimisation invisibly that non-whitehat SEOers do in full view of the human audience.

I wish it wasn't true, but Cory Doctorow's portmanteau of metacrap is unfortunately still accurate ( http://www.well.com/~doctorow/metacrap.htm ). We have made several small metadata improvements over the years, but not much progress in substantially overturning Doctorow's original concerns.

Human beings are not great sources of accurate and up-to-date metadata. So we rely on scripts and services to fill in the blanks that we don't. (publish times being recorded in WordPress, for example.)

Converting human content into metadata ends up being an automated human-hands-off affair which means that those automative tools need to be able to parse and extract information / meaning from human prose. Very much what Google has been doing since inception.

It's flawed because it relies on automated interpretation of prose. But there isn't a viable non-flawed method of getting the same information without imposing academic-like constraints to the Web.



sort by: page size:

As always should look at metacrap (http://www.well.com/~doctorow/metacrap.htm) when discussing the semantic web

- Certain kinds of implicit metadata is awfully useful, in fact. Google exploits metadata about the structure of the World Wide Web: by examining the number of links pointing at a page (and the number of links pointing at each linker), Google can derive statistics about the number of Web-authors who believe that that page is important enough to link to, and hence make extremely reliable guesses about how reputable the information on that page is.

This sort of observational metadata is far more reliable than the stuff that human beings create for the purposes of having their documents found. It cuts through the marketing bullshit, the self-delusion, and the vocabulary collisions.

in short, engineering triumphs over data entry.


I'm still waiting for a comprehensive rebuttal to Cory Doctorow: http://www.well.com/~doctorow/metacrap.htm

Just to clarify - I don't think his arguments demolish all aspects of the case for 'the semantic web' (however ill-defined that term is) but if he's right then it severely circumscribes the kind of content that will ever have useful metadata.

At the same time, we are getting better at inferring context without needing metadata. There is so much more data coming from this source (i.e. the "sod metadata, let's guess" methodology) than from 'intentional' semantic sources.

So - semantic markup will never be of use unless the content is coming from a source where the metadata already exists. It will largely be useful to 'database' styles sites rather than 'content' style sites. Think directories and lists rather than blogs and articles.

(Question to front-end types. Are people still agonising over section vs aside, dl/dt vs ul/li under the impression that it makes any damn difference? Angels dancing on the head of a pin...)


Semantic Web proposals are all vulnerable to the more general "metacrap" problem:

https://people.well.com/user/doctorow/metacrap.htm

Or, as we call it in these modern times, SEO.

There are workarounds for these issues, but they are societal and institutional, and right now the threat from (and rewards for) creating and disseminating misinformation and disinformation continues to rise steadily.


Have you seen this classic piece from 2001? http://www.well.com/~doctorow/metacrap.htm

It argues quite convincingly why semantic web is useless.


I'm not sure why you were downvoted. "Semantic Web" was the first thing that came to my mind after reading the first couple paragraphs of the article. I thought he was going to head that direction as well. I was sorely disappointed!

There are surely diminishing returns for doing increasingly sophisticated things with the contents of HTML tags to parse and understand webpages, using inbound links to rank them, etc.

Cory Doctorow's essay, "Metacrap," does a great job of listing the reasons a Semantic Web-style metadata attempt will always fail when left to the "public" to implement. One thing that the old human-run Yahoo! and the Open Directory Project do get right are the quality of results, but since updates are made at the speed of human, these seem to be pretty much impossible to keep current.

Perhaps there is some neat way to use everyone's browsing histories to create a semantic link between content on the web. But that will never happen because of (extremely valid) privacy concerns.

Well, shame on the author for writing such a myopic rant piece containing no new ideas or proposals.


Well, thereby hangs a tale. But the short answer is that, by and large, Google seems to be aggressively uninterested in scraping that kind of structured/semantic data from the public Web. In fact it even seems to have been active in trying to prevent that kind of data from finding its way into webpages at all: remember the arcane but nasty scrap about metadata in HTML5? Well, this is what that was all about.

Hardly. The promise is still there, but there are barriers in place to get there.

One of the most useful aspects of the semantic web is how it enhances the search for information. Some web citizens have become conditioned to see Google as the pinnacle of what we can achieve through search, but we can do a lot better. Let's use an example to illustrate this. Imagine a presidential election was taking place and you want to understand the positions of the candidates on topics that matter to you. Let's say foreign policy was something you were interested in, including their proclivity for war. By allowing for searching on a richer set of metadata you can more easily access the information about the positions of these candidates, without the distortions of Google's page rank algorithms. Think of it like treating the information of the web as a database you can query more directly. That's the main promise of the semantic web.


Citation needed. To me, all this semantic web stuff sounds exactly like meta tags in the old days - and the big difference between google vs lycos/yahoo/etc. was that it ignored meta tags completely.

Wow, this feels like something out of the mid-2000s.

The problem with the semantic web is that the incentives are mostly wrong. People make websites to be viewed by people. Doing the semantic stuff is work that can be difficult to get right. If it doesn't bring you visitors, why would (most) people bother.

Sure you can get fancy tools that use the data (lets ignore the scaling issues that many of them have), but fancy tools separates the data from the context, further reducing incentives (for many content creators). If they ever did take off, we would have massive spam/quality problems, because we have now separated the data from the website, with all the visual indicators of how spammy it is, which is perfect for dark SEO and other spammers/phisers.

For that matter, just look at metadata on the web in general and what a mess that is. <meta name=description (or keywords) - spammers took over and nobody use them. <link rel="next" i think old opera is the only thing to ever do anything with that metadata.

The only metadata systems that have ever worked is when the site author gets something out of it: e.g. technocrati tags, <link> to rss feeds, facebook opengraph, various google things, etc. Or on the other side, when that is their whole reason for being, like https://wikidata.org and maybe some glam stuff. Everyone making arbitrary metadata out of the goodness of their heart, and having it be of consistent quality and meaning is a pipe dream.

Not to mention the negative incentive of obliterating the walled garden which as much as it sucks is something the corporate overlords like a lot.


This is one of the problems, and it might be the biggest problem with the semantic web, but it's not the only problem. Non-experts are bad at producing remixable content.

* A significant number of people do not really use file names. https://jayfax.neocities.org/mediocrity/gnome-has-no-thumbna...

* Twitter offers no official way to make bold or italic text, but some people make it work using math symbols. They either don't know, or don't care, about all the accessibility and search tooling this breaks.

* Try downloading a C program off the internet and compiling it on anything other than the original developer's computer.

In none of these cases does anybody benefit from the lack of usable metadata, yet nonetheless the metadata doesn't exist.


Semantic Web lost itself in fine details of machine-readable formats, but never solved the problem of getting correctly marked up data from humans.

In the current web and apps people mostly produce information for other people, and this can work even with plain text. Documents may lack semantic markup, or may even have invalid markup, and have totally incorrect invisible metadata, and still be perfectly usable for humans reading them. This is a systemic problem, and won't get better by inventing a nicer RDF syntax.

In language translation, attempts of building rigid formal grammar-based models have failed, and throwing lots of text at a machine learning has succeeded. Semantic Web is most likely doomed in the same way. GPT-3 already seems to have more awareness of the world than anything you can scrape from any semantic database.


"Metadata" works all the time. SMTP is chock full of metadata. HTTP is chock full of metadata.

Doctorow appears to be talking about a very specific kind of metadata: industry standard XML, in situations where people aren't actually exchanging data. The kind of thing that was pushed by the Semantic Web project, which failed pretty badly.

I was involved in Semantic Web, and I've got my own reasons why I thought it failed. Some of them align with Doctorow's reasoning; others don't. It sure didn't help that Semantic Web was neither semantic nor web; bad naming doesn't have to be a problem but it invited difficulty in agreeing on what they were talking about.

And some of it was that Big Data happened along just when they were supposed to be getting going. In theory Semantic Web stuff should have obviated a lot of the data cleaning that goes into Big Data, and made a lot more information available, but Big Data gave sexier results than Semantic Web, and people just didn't want to work on the latter. Why grind out standards when the neurons can just, ya know, figure it out, kinda sorta maybe?

The fact is that data standards work all the time. "Metadata" as Doctorow means it is just one narrow way of defining standards, and that too exists all over the place. Any time people need to exchange data, they set up a standard. It's boring and frustrating and people hate it, and the results are usually bad, but when people need to, it works.

What Semantic Web hoped for was that people would just voluntarily come up with standards and throw their data out there in hopes that somebody else would use it, and that their competitors would use the same standard. Doctorow was right that that was hard, for all kinds of reasons, and rarely actually happens -- especially not with the mechanisms that XML supplies, even supplemented with RDF and OWL.

But in a lot of ways, this essay comes down to Standards Are Hard. Which is true and non-controversial.


Ah yes, the semantic web. Remember when that was going to change everything? All it required was for everyone on the web to meticulously format their data into carefully structured databases. I can't imagine why it never gained much traction.

As an interesting side note, part of the reason of the success of google is that through the page rank algorithm they were able to extract highly important implicit data on the relative popularity, authoritativeness, and context of links rather than being forced to rely only on explicit data.

I will make the claim that in the future devices and systems which work on implicit context and metadata are going to be more successful at these sorts of high level pseudo-cognitive capabilities than anything which is dependent on people changing the way they do everything.


Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?

The Shirky article is a well-known strawman.

Thanks for the pointer to the last one, I'll read it when I get a chance.


The semantic web relies on people not lying. Unfortunately, meta tags were instantly filled with seo spam as soon as they were implemented. It's a trusted client approach to data integrity.

The semantic web is now integrated into the web and for the most part it's invisible. Take a look at the timeline given in this post: https://news.ycombinator.com/item?id=3983179

Some of those startups exited for hundreds of millions, providing, for example, the metadata in the right hand pane of Google search.

The new action buttons in Gmail, adopted by Github, are based on JSON-LD: https://github.com/blog/1891-view-issue-pull-request-buttons...

JSON-LD, which is a profound improvement on and compatible with the original RDF, is the only web metadata standard with a viable future. Read the reflections of Manu Sporny, who overwhelmed competing proposals and bad standards with sheer technical power: http://manu.sporny.org/2014/json-ld-origins-2/

There's really no debate any more. We use the the technology borne by the "Semantic Web" every day.


Well, imagine you're right. What then? Once you have all the magical, microformatted metadata what is it exactly you do with it?

The canonical semantic web idea is that you build ontologies and search them. Now you have a new problem. How do you know which metadata is trustworthy? You'd be a fool to think people aren't going to attempt to game the system. If I search for "all articles about Lucy Liu", how is it that your semantic web search engine filters articles that are really about Lucy Liu from those that are merely marked up to make them appear so? You've just substituted a problem that requires as much intelligence as adding the metadata in the first place. Hell, what am I saying? Trust is a far harder problem. Even Google haven't cracked that one yet.


Ah, I think we might be talking about different things. I think the larger promise of the semantic web is a categorically different thing than adding a bit of meta data to pages to know basic things like author, content type, description, etc.

It’s the latter I think is clearly valuable, in order for us to have competition for the likes of google and Facebook. It lowers the barrier for creating competing search engines, modern rss readers, and even things like distributed social networks.


I got about halfway down, and it suddenly started sounding like the Semantic Web reincarnated as an API service. This idea crops up about once every 10 years (in some form or another), and it runs into fairly predictable problems. There are two good essays that I recommend that people read before getting too excited about how well machines can interoperate without humans in the loop.

Shirky's "The Semantic Web, Syllogism, and Worldview" http://www.shirky.com/writings/herecomeseverybody/semantic_s...

Doctorow's "Metacrap: Putting the torch to seven straw-men of the meta-utopia" http://www.well.com/~doctorow/metacrap.htm

Doctorow talks about problems with metadata, but these problems might apply equally to APIs and the API vocabulary discussed in the article. Specifically:

2.1 People lie 2.2 People are lazy 2.3 People are stupid 2.4 Mission: Impossible -- know thyself 2.5 Schemas aren't neutral 2.6 Metrics influence results 2.7 There's more than one way to describe something

The fundamental problems are that (1) getting people to agree on things is a surprisingly difficult and political problem that can never be solved once and for all, and (2) people have incentives to lie. If you invent a generalized way to look up any weather forcasting API, somebody is going to realize that they can make money gaming the system somehow. PayPal is really in the business of fraud detection, and Google is in the business of fighting against blackhat SEO (and click fraud).

So take your automated API discovery utopia, and explain to me what happens when blackhats try to game the system and pollute your vocabulary for profit. Tell me what will happen when 6 vendors implement an API vocabulary, but none of them quite agree on the corner cases. This is the hard part.

next

Legal | privacy