Well, thereby hangs a tale. But the short answer is that, by and large, Google seems to be aggressively uninterested in scraping that kind of structured/semantic data from the public Web. In fact it even seems to have been active in trying to prevent that kind of data from finding its way into webpages at all: remember the arcane but nasty scrap about metadata in HTML5? Well, this is what that was all about.
I'm pretty sure you're misinterpreting it. Google did not simply write a web scraper that pulls a <business_hours> or a <dc:business_hours> tag out of the web. They wrote a web scraper that super, super intelligently examines the HTML and looks for "anything that looks like business hours"; maybe it's in a table, maybe it's days of the week separated by and <br>, maybe it's in <div>s or <span>s with suggestive CSS class names, maybe it's just in a pile of other HTML. The exact promise of the Semantic Web was that we could just load up a page and get a <business_hours> out of it. Google had to extract the "semantics" with everything but the "semantic web", because the "semantic web" is a no-show. Throwing a crapton of machine learning and humans at extracting semantically useful information from a page is precisely what the Semantic Web isn't.
Which is why it is bizarrely unselfaware when Semantic Web advocates almost inevitably cite that as their biggest success. It isn't. It's their biggest failure.
A lot have happened to HTML since I started with web development in the 90's, but I'm still waiting for the semantic web to catch on ... One problem is the rule of least resistance, people will usually do as little markup as possible. Adding schema itemprop etc is a lot of work for very little gain. The only effect is that it's now easier for Google to scrape your web page to show the data in the search result, instead of sending the user to your site. You offer free lunch, and your biggest competitor takes it all, then sell it for profit. For this reason, sites like Facebook and Linkedin make a lot of effort to make it hard or impossible to scrape data. There are also a lot of other sites that keep hugging their data, walling it off from the rest of the word. This is why the sematic web have a very hard time, even though it's such an great idea.
Various bits of the lowercase semantic web (and its related uppercase Semantic Web cousin) factors show through on Google results - mostly Google Rich Snippets, so reviews metadata. These mostly require human intervention to enable on a site-by-site basis (sometimes by the site owner, sometimes by humans in Google itself).
The deeper problem of the semantic web is that it's meta data, which tends not to be visible to visitors. And because of that, it's easy to forget / overlook / incomplete, and nefariously allows SEO over-optimisation invisibly that non-whitehat SEOers do in full view of the human audience.
I wish it wasn't true, but Cory Doctorow's portmanteau of metacrap is unfortunately still accurate ( http://www.well.com/~doctorow/metacrap.htm ). We have made several small metadata improvements over the years, but not much progress in substantially overturning Doctorow's original concerns.
Human beings are not great sources of accurate and up-to-date metadata. So we rely on scripts and services to fill in the blanks that we don't. (publish times being recorded in WordPress, for example.)
Converting human content into metadata ends up being an automated human-hands-off affair which means that those automative tools need to be able to parse and extract information / meaning from human prose. Very much what Google has been doing since inception.
It's flawed because it relies on automated interpretation of prose. But there isn't a viable non-flawed method of getting the same information without imposing academic-like constraints to the Web.
Google never had a problem working with content that was blended with it's visual presentation. They literally doesn't care. They crawl anything even pdf and other non-html content.
The semantic web was born from a belief that it would enable a peer to peer web that never materialized. Accessibility was better solved by aria annotations a long time ago and the semantic web was never going to properly solve it.
> Google info boxes[...] have nothing to do with the semantic web and everything to do with Google throwing a crapton of machine learning and humans at the problem of parsing distinctly non-semantic HTML until they cracked the problem
Wow, this feels like something out of the mid-2000s.
The problem with the semantic web is that the incentives are mostly wrong. People make websites to be viewed by people. Doing the semantic stuff is work that can be difficult to get right. If it doesn't bring you visitors, why would (most) people bother.
Sure you can get fancy tools that use the data (lets ignore the scaling issues that many of them have), but fancy tools separates the data from the context, further reducing incentives (for many content creators). If they ever did take off, we would have massive spam/quality problems, because we have now separated the data from the website, with all the visual indicators of how spammy it is, which is perfect for dark SEO and other spammers/phisers.
For that matter, just look at metadata on the web in general and what a mess that is. <meta name=description (or keywords) - spammers took over and nobody use them. <link rel="next" i think old opera is the only thing to ever do anything with that metadata.
The only metadata systems that have ever worked is when the site author gets something out of it: e.g. technocrati tags, <link> to rss feeds, facebook opengraph, various google things, etc. Or on the other side, when that is their whole reason for being, like https://wikidata.org and maybe some glam stuff. Everyone making arbitrary metadata out of the goodness of their heart, and having it be of consistent quality and meaning is a pipe dream.
Not to mention the negative incentive of obliterating the walled garden which as much as it sucks is something the corporate overlords like a lot.
Ah yes, the semantic web. Remember when that was going to change everything? All it required was for everyone on the web to meticulously format their data into carefully structured databases. I can't imagine why it never gained much traction.
As an interesting side note, part of the reason of the success of google is that through the page rank algorithm they were able to extract highly important implicit data on the relative popularity, authoritativeness, and context of links rather than being forced to rely only on explicit data.
I will make the claim that in the future devices and systems which work on implicit context and metadata are going to be more successful at these sorts of high level pseudo-cognitive capabilities than anything which is dependent on people changing the way they do everything.
There was/is typically zero incentive to semantically encode data. As shown by things like IBMs Watson, you're better off extracting sentence semantics with NLP than by hoping someone has encoded the data for you.
But yet, adoption of SemWeb tech is growing. Google and Yahoo both rolling out their respective "rich snippets" type interfaces a couple of years ago helped a bit. Those acts raised awareness of the value of SemWeb tech and gave it a bit of a kick. I saw a post[1] earlier today that mentioned some research showing that 25% of the HTML pages out there have embedded structured data (microformats, RDFa).
- Certain kinds of implicit metadata is awfully useful, in fact. Google exploits metadata about the structure of the World Wide Web: by examining the number of links pointing at a page (and the number of links pointing at each linker), Google can derive statistics about the number of Web-authors who believe that that page is important enough to link to, and hence make extremely reliable guesses about how reputable the information on that page is.
This sort of observational metadata is far more reliable than the stuff that human beings create for the purposes of having their documents found. It cuts through the marketing bullshit, the self-delusion, and the vocabulary collisions.
I'm not sure why you were downvoted. "Semantic Web" was the first thing that came to my mind after reading the first couple paragraphs of the article. I thought he was going to head that direction as well. I was sorely disappointed!
There are surely diminishing returns for doing increasingly sophisticated things with the contents of HTML tags to parse and understand webpages, using inbound links to rank them, etc.
Cory Doctorow's essay, "Metacrap," does a great job of listing the reasons a Semantic Web-style metadata attempt will always fail when left to the "public" to implement. One thing that the old human-run Yahoo! and the Open Directory Project do get right are the quality of results, but since updates are made at the speed of human, these seem to be pretty much impossible to keep current.
Perhaps there is some neat way to use everyone's browsing histories to create a semantic link between content on the web. But that will never happen because of (extremely valid) privacy concerns.
Well, shame on the author for writing such a myopic rant piece containing no new ideas or proposals.
This will eventually mean anyone engaged in getting data off the web for machine processing (i.e. scraping) is probably looking at needing to run headless browsers and using OCR to get the information out in future. I kind of wonder if Google etc. do this already to extract information based on layout.
Somewhere along the way we need a better way to exchange data between websites before this comes along and sets us back by a couple of decades. The semantic web stuff (sadly) hasn't really worked, but something of that ilk is needed to get us to the future.
I think the people really pushing for the Semantic Web kind of gave up. You hardly ever hear that term anymore.
I guess the value proposition of "You can add a whole bunch of complexity to your webpage that won't affect what people see so robots can scrape your page easier" didn't really resonate with developers. Also, the proposals I saw were much too granular and focused on people writing scientific papers on the web. It wasn't a good mesh for the "garbage" web, which is like 99% of everything.
One of the biggest barriers to the semantic web is the barrier to entry. Scraping web pages are hard. Parsing HTML (which probably doesn't validate) is hard. Extracting semantic meaning from parsed HTML is hard.
Even once you've piled on the libraries and extracted the bit of information that you need, what do you do with that data? You process it a bit and store it in some kind of data structure. But at this point, you've could have just pinged the website's API and gotten the same data (and more) in a data structure.
It turns out it's a heck of a lot easier to return a blob of JSON than it is to process text in markup on a page. And smaller, as well: JSON often takes up far less space than the corresponding markup for the same content. That's a big deal when you're processing a very large amount of information.
There's the promise that AI will someday make this easier: if you eliminate the parse-and-process-into-a-data-structure step and just assume that an AI will do that part for you, you're in good shape. But that's nowhere near being a practical reality for virtually all developers, and APIs eliminate this step for you.
Even if you use something like HTML microdata, there's very few consumers of the data. Some browsers give you access to it, but that doesn't make it extremely useful: if you generated the data on the server side, why not just make it into a useful interface? Or expose the data as raw data to begin with? Going through the extra effort to use these APIs is a redundant step for most use cases.
The financial incentives have become stronger for building walled gardens than a semantically open web. The semantic data has been more useful to giants that monetize it, than to millions of small publishers who are supposed to abide by the rules and maintain it. The issue is even bigger if you are listing valuable goods - from products, to jobs, to real estate/rental listings as a part of your marketplace or business. Aggregators like google can scrape and circumvent you, by taking away your users earlier in the acquisition chain, so why bother giving them your product graph.
I wonder how things would look like if the 'semantic web' (the actual web3?) had taken off and we had regular and rich, machine-readable metadata for just about everything rather than having to rely on what is largely subpar scraping and 'AI' systems.
People have pointed out recently how Google search seems to struggle as sites on the internet turn more and more into apps rather than standardized documents and they just go and search on reddit. Having a standard to encode semantics seems honestly necessary at this point if you want to keep things interoperable.
> Data on the web will only be "semantic" if that is the default, and with this technique it will be.
Not going to work unless imposed by some external force. The semantics of the web can more practically be extracted with neural nets, but it's a long tail and there are errors. Lots of good work recently in parsing tables, document layouts and key-value extraction. LayoutLM and its kin comes to mind.[1]
Not going to happen. The reason for the Semantic Web never taking off were never technical. Websites already spend a lot of money on technical SEO and would happily add all sorts of metadata if only it helped them rank better. Of course, many sites’ metadata would blatantly “lie” and hence, the likes of Google would never trust it.
Re exposing an entire database of static content: again, reality gets in the way. Websites want to keep control over how they present their data. Not to mention that many news sites segregate their content as public and paywalled. Making raw content available as a structured and query able database may work for the likes of Wikipedia or arxiv.org. But it’ll not likely going to be adopted by commercial sites.
reply