Is there a reasonable alternative? I've worked with RDF and LD and can agree with most of your points but would like to see if there is something better on the horizon.
RDF is a zombie format. It isn't entirely "dead" but after some wildly ambitious promises, it hasn't lived up to any of them, and it seems only a few "true believers" are still working with it. This being a world of seven billion people, a few "true believers" can still be a few thousand people, but it's just not a technology I can say is vibrant, alive, or worth spending much time with. The core problem is that it solves the easy problems, and not even terribly well (I more-or-less endorse titanix2's comment elsewhere in the thread), while leaving the hard problems mostly untouched. The excitement it generated 10-15 years ago mostly comes from confusing solving the easy problem for the hard problem. Everything that I've personally ever seen used RDF has ripped it back out again at some point. Whether that's the fate in store for the current users named in the article I can't say, but it certainly wouldn't be a bad guess.
If the space intrigues you, I would suggest starting with a graph database instead.
Please don't bring RDF out of its coffin. It has been tried and failed because it's overly complex and verbose. It's terrible, and technologies which still use it are terrible to interact with to this day.
Does anyone actually use RDF? The W3C seems to love it, but I can't help but feel it's one of those technologies that just fell through the cracks and failed to gain any users.
Thank you, that's precious info. I started exploring the tools last year an dug into rdf, feeling very uncomfortable with the buzz mismatch of "hey but everybody's using JSON ! that's not helping LD's adoption..."
Yeah, I touched on RDF in my post. I have some experience with it, but it is pretty complex for just data. I think it is powerful, but the thing that makes the web work is simplicity.
Thanks for your comment on it by the way. I'm still in the phase of gathering what everyone thinks of it. I've noticed that RDF seems a bit polarizing. I have the suspicion that people who feel neutral about it also don't feel the need to chime in.
To be honest, I'm not 100% sure. The tooling that exists - at least in the open source space - seems to be vastly dominated by Java and academic endeavours. That being said, I think the ideas behind RDF are pretty awesome, and its simplistic structure makes full indexing, datalog inference and layering possible.
I can't find much to corroborate this article's take. RDF is a stark counter-example - a standard from the W3C. It has endorsement from Tim Bray, one of XML's co-authors.
I've also experienced performance issues with RDF stores, but over the last few years, that has increased a lot and I think in another 1-2 years there will be a bunch of stores that are able to handle reasonable large numbers of triples with a good performance.
Aside from that I think, although RDF might not be the 'holy grail', it is in fact quite usable for a lot of problem domains and saying that it needs to be discarded is a bit harsh :)
Familiarity isn't nearly enough if you want to implement something.
Talking about RDF is absolutely meaningless without talking about Serialisation (and that includes ...URGH.. XML serialisation), XML Schema data-types, localisations, skolemisation, and the ongoing blank-node war.
The semantic web ecosystem is the prime example of "the devils in the detail". Of course you can explain to somebody who knows what a graph is, the general idea of RDF: "It's like a graph, but the edges are also reified as nodes."
But that omits basically everything.
It doesn't matter if SparQL is learnable or not, it matters if its implementable, let alone in a performant way. And thats really really questionable.
Jena is okay-ish, but it's neither pleasant to use, nor bug free, although java has the best RDF libs generally (I think thats got something to do with academic selection bias). RDF4J has 300 open issues, but they also contain a lot of refactoring noise, which isn't a bad thing.
C'mon, rdflib is a joke. It has a ridiculous 200 issues / 1 commit a month ratio, buggy as hell, and is for all intents and purposes abandonware.
rdflib.js is in memory only, so nothing you could use in production for anything beyond simple stuff.
Also there's essentially ZERO documentation.
And none of those except for Jena even step into the realm of OWL.
> What are the alternatives?
Good question.
SIMPLICITY!
We have an RDF replacement running in production that's twice as fast, and 100 times simpler.
Our implementation clocks in at 2.5kloc, and that includes everything from storage to queries,
with zero dependencies.
By having something that's so simple to implement, it's super easy to port it to various programming languages,
experiment with implementations, and exterminate bugs.
We don't have triples, we have tribles (binary triples, get it, nudge nudge, wink wink).
64 Byte in total, fits into exactly one cache line on the majority of Architectures.
These tribles are stored in knowledge bases with grow-set semantics,
so you can only ever append (on a meta level knowledge bases do support non-monotonic set operations),
which is the only way you can get consistency with open world-semantics, which is something that the OWL people
apparently forgot to tell pretty much everybody who wrote RDF stores, as they all have some form of non-mononic
delete operation. Even SparQL is non-monotonic with it's optional operator...
Having a fixed size binary representation makes this compatible with most existing databases,
and almost trivial to implement covering indices and multiway joins for.
By choosing UUIDs (or ULIDs, or TimeFlakes, or whatever, the 16byte don't care) for subject and predicate we completely
circumnavigate the issues of naming, and schema evolution.
I've seen so many hours wasted by ontologists arguing about what something should be called.
In our case, it doesn't matter, both consumers of the schema can choose their own name in their code.
And if you want to upgrade your schema, simply create a new attribute id, and change the name in your code to point to it instead.
If a value is larger than 32 byte, we store a 256bit hash in the trible, and store the data itself in a a separate blob store (in our production case S3, but for tests it's the file stystem, we're eyeing a IPFS adapter but that's only useful if we open-sourced it).
Which means that it's also working nicely with binary data, which RDF never managed to do well. (We use it to mix machine learning models with symbolic knowledge).
We stole the context approach from jsonLD, so that you can define your own serialisers and deserialisers depending on the context they are used in.
So you might have a "legacyTimestamp" attribute which returns a util.datetime, and a "timestamp" which returns a JodaTime Object.
However unlinke jsonLD these are not static transformations on the graph, but done just in time through the interface that exposes the graph.
We have two interfaces. One based on conjunctive queries which looks like this (JS as an example):
and the other based on tree walking, where you get a proxy object that you can treat as any other object graph in your programming language,
and you can just navigate it by traversing it's properties, lazily creating a tree unfolding.
Our schema description is also heavily simplified. We only have property restrictions and no classes.
For classes there's ALWAYS a counter example of something that intuitively is in that class, but which is excluded by the class definition.
At the same time, classes are the source of pretty much all computational complexity. (Can't count if you don't have fingers.)
We do have cardinality restrictions, but restrict the range of attributes to be limited to one type. That way you can statically type check queries and walks in statically typed languages. And remember, attributes are UUIDs and thus essentially free, simply create one attribute per type.
In the above example you'll notice that queries are tree queries with variables. They're what's most common, and also what's compatible with the data-structures and tools available in most programming languages (except for maybe prolog). However we do support full conjunctive queries over triples, and it's what these queries get compiled to. We just don't want to step into the same impedance mismatch trap datalog steps into.
Our query "engine" (much simpler, no optimiser for example), performs a lazy depth first walk over the variables and performs a multiway set intersection for each, which generalises the join of conjunctive queries, to arbitrary constraints (like, I want only attributes that also occur in this list). Because it's lazy you get limit queries for free. And because no intermediary query results are materialised, you can implement aggregates with a simple reduction of the result sequence.
The "generic constraint resolution" approach to joins also gives us queries that can span multiple knowledge bases (without federation, but we're working on something like that based on differential dataflow).
Multi-kb queries are especially useful since our default in-memory knowledge base is actually an immutable persistent data-structure, so it's trivial and cheap to work with many different variants at the same time. They efficiently support all set operations, so you can do functional logic programming a la "out of the tar pit", in pretty much any programming language.
Another cool thing is that our on-disk storage format is really resilient through it's simplicity.
Because the semantics are append only, we can store everything in a log file. Each transaction is prefixed with a
hash of the transaction and followed by the tribles of the transaction, and because of their constant size, framing is trivial.
We can loose arbitrary chunks of our database and still retain the data that was unaffected. Try that with your RDMBS,
you will loose everything. It also makes merging multiple databases super easy (remember UUIDs to prevent naming collisions, monotonic open world semantics keep consistency, fixed size tribles make framing trivial), you simply `cat db1 db2 > outdb` them.
Again, all of this in 2.5kloc with zero dependencies (we do have one on S3 in the S3 blob store adapter).
Is this the way to go? I don't know, it serves us well. But the great thing about it is that there could be dozens of equally simple systems and standards, and we could actually see which approaches are best, from usage.
The semantic web community is currently sitting on a pile of ivory, contemplating on how to best steer the titanics that are protege, and OWLAPI through the waters of computational complexity. Without anybody every stopping to ask if that's REALLY been the big problem all along.
"I'd really love to use OWL and RDF, if only the algorithms were in a different complexity class!"
As somebody who worked with RDF and SPARQL for several years, none of these are actually true - RDF is very simple to work with (especially if you avoid XML stuff), can be operated on by basic string processing tools, and is conceptually pretty simple once you get into the right mindset. I think it's just suffering from bad documentation being overexposed and good examples under-exposed.
reply