I've also had to parse Wikitext. The fact that there are 54 parsers in various states of disrepair listed here (and I have written a 55th) is not because people really like reinventing this wheel; it's because the complete task is absolutely insurmountable, and everyone needs a different piece of it solved.
The moment a template gets involved, the structure of an article is not well-defined. Templates can call MediaWiki built-ins that are implemented in PHP, or extensions that are implemented in Lua. Templates can output more syntax that depends on the surrounding context, kind of like unsafe macros in C. Error-handling is ad-hoc and certain pages depend on the undefined results of error handling. The end result is only defined by the exact pile of code that the site is running.
If you reproduce that exact pile of code... now you can parse Wikitext into HTML that looks like Wikipedia. That's probably not what you needed, and if it was, you could have used a web scraping library.
It's a mess and Visual Editor has not cleaned it up. The problem is that the syntax of Wikitext wasn't designed; like everything else surrounding Wikipedia, it happened by vague consensus.
It's even worse. The syntax of MediaWiki markup (parser functions, etc.) will vary depending on extensions installed in a given instance. And there is no way to obtain a list of parser functions and extension tags (the latter of which look like HTML tags, but parse differently). There is literally no way to reliably parse MediaWiki markup offline. [0] can give you a parse tree — but only of a top-level page, you can't recursively expand templates with it. And even that parse tree only gives you information about which templates should be expanded. After expanding templates you have to parse the whole thing again to actually interpret the formatting markup. Oh, and expand strip markers: [1]. Because some extensions which define their own tags will want to output raw, unfiltered HTML. But tags are parsed in the template expansion stage and raw HTML isn't allowed in the formatting stage, so what do the extensions do? They output "strip markers" that are later substituted for the raw HTML.
Most alternative parsing libraries (at least all the ones I looked at, and that includes Wikimedia's own Parsoid!) don't bother with implementing all that complexity. Which means that they can be often tripped by sufficiently tricky markup; and this turns out to be a quite low bar.
Parsing MediaWiki markup properly is an insanity on a level comparable only with TeX and Unix shell scripts. Even PHP is saner.
Most people who praise the fact that Wikipedia and other software is written in PHP have never actually taken a good look at the source code and tried to do anything with it.
Care to discuss the design and architecture of the Mediawiki parser and template system, and what its history and future look like?
I'm pretty confident I don't want this. I'd like to be proven wrong, but there was maybe a couple of cases among hundreds when perfectly working UI was actually improved instead of fucking broken by "getting a new look".
What I actually would like to be improved is how the data is represented internally. Let's face it, it has been a long time since Wikipedia has become the largest and most accessible knowledge base in the world. It was made simple, which is a part of why it is successful. MediaWiki is pretty much a blogging engine, article content and structure is restricted by the guidelines only, not by the engine. This is good, and at the early stages it was the only possible solution, since there is so much that can be useful in an encyclopedia article.
However, now we know a lot of patterns of how data represented in different articles is similar. But wikitext is still basically just a human language with a bit of markup elements. Sure, there are templates, which are slowly improving over the time, but still, it's petty much a collection of free-form blogposts. There is almost no way to separate data from the representation (i.e. to have a table contents as a csv, with a template assigned). There were some attempts to make it possible, but querying (or modifying) data automatically is still very much a pain. A lot of it requires parsing, and (perhaps even more annoyingly) — pretty simple parsing. But you have to do that work for every single problem you have. Instead of, you know, to just query data I know there is. In fact, when you start parsing templates it becomes very clear very quickly, that they are vastly underused. Absolutely the same data, like years of life in bio-articles is still represented differently even in very elaborate and well-maintained articles.
Very little has been done in this regard over the last 10 years, I feel. And improvement does require a lot of work: looking for possible templates, promoting generic data representation among contributors, modifying wikitext to allow for it, improving the API. It they are looking how to spend money, I'd rather want it to be this, because making a lot of data machine-accessible would be literally as huge, as making wikipedia human-accessible was about 20-15 years ago. Sometimes I need to open 500 articles to read a single sentence in each of them, and some python script could've done it for me in a second (a couple of seconds maybe, depending on how wikipedia stores and represents the data).
Please reread, for many purposes! I love Wikipedia.
The Wiki markup is extremely complicated and being user created, it is also inconsistent and error prone. I believe the MediaWiki parser itself is something like a single 5000 line PHP function! All of the alternate parsers I've tried are not perfect. There is a ton of information encoded in the semi-structured markup, but it's still not easy to turn that into actual structured data. That's where the problem lies.
It's worse. The MediaWiki PHP code doesn't implement a proper scanner and parser, it's a bunch of regexes around which the code has grown more or less organically. Silent compensation for mismatched starting and ending tokens abounds, and causes problems for all consumers of the markup, in the same way that lenient HTML parsers have. The difference is that Wikipedia, as the sole channel for editing markup, could have easily rejected syntax errors with helpful messages instead of silently compensating.
If it was anything else, I'd say "who cares," but this is "the world's knowledge" -- we absolutely should care about the format it's stored in. I'm glad to see people tackling this problem.
I'm actually curious why PHP was chosen instead of Rust or Go given that the parsing team wasn't familiar with the language. I understand that MediaWiki is written in PHP, but it sounds like they were already comfortable with language heterogeny.
They claim,
> The two wikitext engines were different in terms of implementation language, fundamental architecture, and modeling of wikitext semantics (how they represented the "meaning" of wikitext). These differences impacted development of new features as well as the conversation around the evolution of wikitext and templating in our projects. While the differences in implementation language and architecture were the most obvious and talked-about issues, this last concern -- platform evolution -- is no less important, and has motivated the careful and deliberate way we have approached integration of the two engines.
Which is I suppose a compelling reason for a rewrite if you're understaffed.
I'd still be interested in writing it in Rust and then writing PHP bindings. There's even a possibility of running a WASM engine in the browser and skipping the roundtrip for evaluation.
There are lots of attempts to write new Wikipedia parsers that just do "the useful stuff", like getting the text. They all fail, for the simple reason that some of the text comes from MediaWiki templates.
E.g.
about {{convert|55|km|0|abbr=on}} east of
will turn into
about 55 km (34 mi) east of
and
{{As of|2010|7|5}}
will turn into
As of 5 July 2010
and so on (there are thousands of relevant templates). It's simply not possible to get the full plain text without processing the templates, and the only system that can correctly and completely parse the templates is MediaWiki itself.
Yes it's a huge system entirely written in PHP, but you can make a simple command line parser with it pretty easily (though it took me quite a while to figure out how). The key points are to put something like
at the start of it, and then use the Parser class. You get HTML out, but it's simple and well-formed (to get text, start with the top level p tags).
To get it to process templates, get a Wikipedia dump, extract the templates, and use the mwdumper tool to import them into your local MediaWiki database.
I don't know if this is the best or "right" way to do it, but it's the only way I've found that actually works.
> Given the complexity of Wikipedia's deployment compared to a typical MediaWiki installation, it really wouldn't be much effort to hook into a parser in say, Java rather than PHP...
No doubt the incremental complexity for Wikipedia would be small in relative terms. I assume that argument would support a variety of proposals.
A solid scanner and parser in C/C++ would benefit a broader audience though. All the major scripting languages can be extended in C/C++. In fact, the ragel-based parser I mentioned earlier [1] was built to be used from within Ruby code.
Yeah, I understand that. I'm re-purposing the data and it's my job to decide how that works.
But this could be easier. What I hate about Wikimedia's format is templates. They are not very human-editable (try editing a template sometime; unless you're an absolute pro, you will break thousands of articles and be kindly asked to never to do that again) and not very computer-parseable. They're just the first thing someone thought of that worked with MediaWiki's existing feature set and put the right thing on the screen.
Replacing templates with something better -- which would require a project-wide engineering effort -- could make things more accessible to everyone.
FWIW, I do make the intermediate results of my parser downloadable, although to consider them "released" would require documenting them. For example: [1]
The description is a little vague and hand-wavey. Here's a concrete example:
A lot of Wikipedia sites have scripts embedded in the wikitext which automatically generate or transform information on a page, e.g. automatically performing unit conversions to generate text like "I would walk 500 miles (804.67 km)", performing date math to automatically generate and update a person's age based on their birthdate, or querying structured data from Wikidata [1] to display in an infobox. One example of these scripts is the {{convert}} [2] template on the English Wikipedia.
Initially, these scripts were written in MediaWiki template logic [3], and were maintained individually on each wiki. This quickly proved unmaintainable, and some of those scripts were rewritten in Lua using the Scribunto extension [4], but these were still per-wiki, and there were frequently issues where different wikis would copy scripts from each other and introduce their own incompatible features.
The WikiFunctions project is an attempt to centralize development of these scripts, much like how Wikimedia Commons [5] centralizes hosting of freely licensed images and other media.
Wikipedia faces unique challenges, however, that many web applications do not face. Almost the entirety of the page rendered is user-generated content produced from hand-written wiki markup that must be parsed and rendered, and that uses complex nested templates. MediaWiki is basically a turing-complete applications platform. The wikipedia page for Barack Obama transcludes 199[0] unique templates, with 585 total non-unique invocations, and that's not counting the templates transcluded by those templates. Some of Wikipedia's templates are so complex that they had to be rewritten in Lua.
I don't think Wikipedia is really a typical case. Most websites probably don't have the CPU burden Wikipedia faces.
> The available parsers are not very robust and not very complete, because the wiki syntax is extremely convoluted and there is no formal spec. Second, the wiki syntax includes a kind of macro system. Without actually executing those macros you don't get the complete page as you see it online. The only way to get the complete and correct page content, to my knowledge, is to install the mediawiki site and import the data.
Precisely this makes Wikipedia a pain to work with for text mining. I thought I had found a great option when I found the Freebase WEX dump [1] of Wikipedia that is in pure XML, but they have issues of their own with duplicated text etc. due to all the silliness in the original MediaWiki markup. If I am going for trying to extract the article texts again I may try the DBPedia long abstract dumps [2].
I am not sure what people really do when they utilise Wikipedia articles for research and applications. But I assume they just do their best and try to get the cleanest possible text out of that enormous mess of mark-up (did I mention that the specification is the implementation itself?). If anyone has a good way to get raw text without any mark-up out of Wikipedia I will gladly send you a postcard expressing my gratitude. It just makes me sad that we have an enormous resource that is well-curated and we are stuck in the mud because of a stupid engineering decision in the early history of MediaWiki.
The annoying thing about Wiktionary (for this purpose) is that it's explicitly not machine readable, so the entries can only really be handled as free-form HTML. Parsing the Wikicode, despite heavy use of standard templates, would be very hard to do robustly.
And also that it says that WikiPedia articles are in an impenetrable, UNPARSABLE format that you can't run scripts against. This is absurd. If it's unparsable, how does Mediawiki manage? because it IS parsable! oh my, in order to work with the content you need to either use the MW parser or write your own. Well, literally everything is like that. just because there aren't 5 different MW parser engines out there doesn't mean it's unparsable.
I looked into the topic in some detail year ago, because I copied Wikipedia's approach for an in-house CMS editor. I looked through the code, did loads of performance experiments, read through related forums posts, etc...
The wiki staff themselves admitted that the current parser is inefficient, partly because of PHP, partly because the underlying grammar was not designed to be efficient, and partly because the parser itself was built up over time and wasn't easy to optimise.
My approach was to write the parser and a matching "markdown inspired" format at the same time, optimised for speed. Just a handful of small tweaks to the syntax were all that was required to largely eliminate backtracking and achieve a nearly linear parsing time in most cases. If I remember correctly, I had it down to about 1-5ms for a typical 64 KB page, and then HTML generation was another 5-10ms depending on various factors.
What a lot of people are missing here is that Wikis are not at all like typical "ERP" applications. The latter sometimes requires dozens of API calls and thousands of database queries to generate just one kilobyte of output HTML. Wiki is very linear, with a single 1-100KB blob of text as input, a matching 1-100KB blob of HTML as output. It all boils down to the parsing and HTML generation efficiency, nothing else matters!
Most of these are special purpose hacks. Kiwi and Sweble are the most serious projects I'm aware of, that have tried to generate a full parse.
However, few of these projects are useful for upgrading Wikipedia itself. Even the general parsers like Sweble are effectively special-purpose, since we have a lot of PHP that hooks into the parser and warps its behaviour in "interesting" ways. The average parser geek usually wants to write to a cleaner spec in, well, any language other than PHP. ;)
Currently the Wikimedia Foundation is just starting a MediaWiki.next project. Parsing is just one of the things we are going to change in major ways -- fixing this will make it much easier to do WYSIWYG editing or to publish content in ways that aren't just HTML pages.
(Obviously we will be looking at Sweble carefully.)
If this sounds like a fun project to you, please get in touch! Or check out the "Future" portal on MediaWiki.org.
I disagree with most of this (for context im a mediawiki developer. I used to work for WMF but dont anymore. My opinions are my own):
-wikitext unparsable - wikitext is a bit insane but there exists a parser called parsoid. If we couldn't parse it it would be impossible to make a visual editor
- most pages are redirects: not sure what the problem is.
-boilerplate text - there is a template system to have repeated text in only one place. Not sure what the issue is.
- he goes on a rant about no search without giving much context. There is a search feature based on elastic search (older versions were based on lucene directly). In my opinion its a pretty decent search engine (especially compared to most sites that make their own search). I'm not sure what the actual complaint is.
-complaints about wikidata - this is more political than technical, however "If wikidata was a company, it would not exist anymore, and you wouldn't have heard of it." seems patently false. Wikidata is pretty popular even outside of academia, and is used quite extensively.
- category tree being a graph not a tree - that's kind of unfortunate but what exactly is the problem here. Its a problem on commons, but i've never really seen how its an issue in practise on wikipedia. Complex categorization is being taken over by wikidata anyways.
Template ecosystem is complex - it could certainly be better, but the complexity here is a trade-off allowing more flexibility and allowing the system to evolve.
Inclusionist vs deletionist: no comment
UI design: perhaps a fair point here, although i do kind of like the stability of the current design. Most of the modern web sucks imho.
Moral failure: i dont want wikipedia to fix the world. That's not its role. Its job is to document, not to partake.
Viz editor not being enabled by default: i agree, although i think it was pushed too hard in early days when there was still kinks, but its long past time now.
To be clear, i definitely dont think wikipedia is perfect, i just disagree with some of these specific criticisms.
I think of the MediaWiki wikitext specification and how that started as a simplification of HTML and was incrementally extended into a barely-computable disaster that, quite literally, put back a Wikipedia visual editor about six years.
The moment a template gets involved, the structure of an article is not well-defined. Templates can call MediaWiki built-ins that are implemented in PHP, or extensions that are implemented in Lua. Templates can output more syntax that depends on the surrounding context, kind of like unsafe macros in C. Error-handling is ad-hoc and certain pages depend on the undefined results of error handling. The end result is only defined by the exact pile of code that the site is running.
If you reproduce that exact pile of code... now you can parse Wikitext into HTML that looks like Wikipedia. That's probably not what you needed, and if it was, you could have used a web scraping library.
It's a mess and Visual Editor has not cleaned it up. The problem is that the syntax of Wikitext wasn't designed; like everything else surrounding Wikipedia, it happened by vague consensus.
reply