Hacker Read

laumars | karma 12945 | avg karma 2.49 · 2018-04-23 21:03:22+00:00

I can't help feeling like there is a lot of tool blaming happening when the wrong tools were used in the first place. Wikipedia is pretty easy to scrape general blocks of text (I'm the author of an IRC bot which did link previewing, inc Wikipedia) but if you need specific, machine readable, passages which aren't going to change sentence structure over the years then you really should be getting that information from a proper API which cateloges that information. Even if it means having to buikd your own backend process which polls the websites for the 20 respective OSs individually so you can compile your own API.

Using an encyclopedia which is constantly being updated and is written to be read by humans as a stable API for machines is just insanity in my honest opinion.

reply

always_good | karma 4605 | avg karma 3.29 · 2018-04-23 21:15:27

Agreed. It's some mix of the XY problem plus the self-entitlement of "if I had the idea, then it should work."

Yet the classic HNer confuses this for inherent weaknesses in the underlying platform that they then need to share lest someone has something good to say about the platform. And they'll often be using words like "terrible", "garbage", and "I hate..."

reply

squeaky-clean | karma 8646 | avg karma 3.1 · 2018-04-23 22:13:16

Please, no one said Wikipedia was terrible. You're taking statements out of context. The original comment said:

> Wikipedia's markup is just terrible for trying to do any sort of scraping or analysis.

I'd like to emphasize the "for trying to do any sort of scraping or analysis." Should we instead lie and say it's wonderful for scraping?

It's not an insult, it's the truth. If you want to build an app that automatically parses Wikipedia, it will not be easy.

reply

laumars | karma 12945 | avg karma 2.49 · 2018-04-24 05:43:22+00:00

But again, that's the wrong tool for the job so of course it's not going to be well suited. When it's that obvious of a wrong tool saying it's terrible is still kind of silly. It's like saying hammers are terrible at screwing things or cars make terrible trampolines.

squeaky-clean | karma 8646 | avg karma 3.1 · 2018-04-23 22:09:25+00:00

> Even if it means having to build your own backend process which polls the websites for the 20 respective OSs individually so you can compile your own API

One caveat there, a page like that for MacOS doesn't exist. Scraping Wikipedia may be insane, but it's often the best option. You can scrape macrumors or something, but then you're still just parsing a site meant to be read by humans. You also still risk those 20 OS websites changing as much as Wikipedia.

reply

laumars | karma 12945 | avg karma 2.49 · 2018-04-24 05:56:30

Indeed but I was thinking of endpoints that have remained relatively static because they have been auto generated or a history of scraping. Some Linux distros have pages like that (even if it's just a mirror list).

But my preferred solution would be using whatever endpoint the respective platform uses for notifying their users of updates.

This strikes me as a solved problem but even if you can't find a ready to use API then I'd probably sign up to a few mailing lists, update my own endpoint manually and offer 3rd party access for a modest subscription.

Either way, scraping an encyclopedia for an English phrase to parse strikes me as the worst of all the possible solutions.

reply

rexaliquid | karma 74 | avg karma 1.37 · 2018-04-23 22:17:22+00:00

At the very least, parsing the Release History table seems way better than looking for a particular phrase in the text.

jancsika | karma 9361 | avg karma 2.84 · 2018-04-24 03:44:23+00:00

> I can't help feeling like there is a lot of tool blaming happening when the wrong tools were used in the first place.

Well, let's be fair: it's a bit surprising that a series of clear, readable key/value pairs in that Wikipedia "MacOS" infobox table can't be delivered by their API as JSON key/value pairs.

Using their API I can generate a JSON that has a big blob sandwiched in there. With the xmlfm format[1] that same blob has some nice-looking "key = value" pairs, too. Funny enough, those pairs for some reason exclude the "latest release" key.

Anyway, is there any case where a <tr> containing two columns in the Wikipedia infobox table doesn't hold a key/value pair? That just seems like such a valuable source of data to make available in simple JSON format.

[1] https://en.wikipedia.org/w/api.php?action=query&prop=revisio...

reply