Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Why do we need database joins? (lemire.me) similar stories update story
28 points by timf | karma 19412 | avg karma 12.19 2010-12-05 13:03:45 | hide | past | favorite | 32 comments



view as:

The site is down for me.

Anyway, I don't think database joins are all that important. They make sense when structuring your schema - only store a unique piece of data once and then connect other data to it. This is a very sound design choice.

But in light of modern key/value stores, I don't think one needs to perform joins on the database anymore. Sure it's theoretically a bit faster than performing the join in your app, but considering all the other implications and quickly you realise you don't need those.

Consider these two options: Pull data from a db, it's neatly tied together with all its meta-data. But the query takes almost a second and when the DB is always worried about consistency and stuff so writes are kind of slow too.

Pull data from a db. If you need some related data, you make another query based on the key stored in the first piece of data. Each query takes a fraction of a second, together they don't take as long as the full join query ... and you're only joining stuff when you actually need all of the data.

The added benefit is that in this situation the DB doesn't have to worry about almost anything and is thus much faster.

If that made any sense, I'm preparing a seminar on NoSQL so I'd love to hear some input on my understanding of things :)


The problem is that updates becomes a monster problem if you have lots of connected records, since the same information is duplicated in many places over your data store. Failed updates can easily lead to inconsistencies.

It depends on the data of course, but sometimes it might be easier to just create a new entry rather than update an old one :)

> The site is down for me.

http://pastebin.com/Dmpyyv8Q


> But the query takes almost a second

I don't think the existence of a join is sufficient to slowness like this

> and when the DB is always worried about consistency and stuff so writes are kind of slow too.

Has nothing to do with relational data per se. Unless you're using foreign keys and calling that a join (which is actually fair). NoSQL stores typically do lots of work to ensure consistency and many support multi-statement transactions on top of this. Unless we're talking about only keeping data in memory but you can make a relational db do that.

> If you need some related data, you make another query based on the key stored in the first piece of data. Each query takes a fraction of a second, together they don't take as long as the full join query

That is exactly how a join is executed under the hood. Loop over rows in one table, look up a value in another table. However, the db won't force you to join based on a key. If all you're doing is looking up unique keys there's no way multiple calls will be faster unless all the data won't fit on one machine.


Down for me too, so I'm only responding to your comment.

>[using a DB join means that] the query takes almost a second

This is, at best, an overgeneralization and at worst outright FUD. I don't know what sort of data or database engine you're working with that might make this true, but it's definitely not a requirement that it be this way. The RDBMS backing my site is trivially able to return all needed data in fractions of a second for most cases regardless of the presence of joins. If it can't, that generally just means that I forgot to index something properly.

Moving past this, going with a simple key/value store instead of an RDBMS means giving up at least three of the ACID properties in favor of nothing more than a speed gain.

- Atomic: No transactions. If the system dies between two related commits...oops, your dataset is inconsistent.

- Isolated: No transactions, again. You cannot make guarantees about what information an object will have -- e.g., you cannot say "Every user will always have at least one entry in Pages, because we automatically create a non-deletable Home page for every user at signup." If you rely on that, you will eventually hit a user who is halfway through the signup process and has a user entry but no Page entry.

- Durable. If your backing store has not actually flushed your write to disk and the system crashes...oops, you're hosed. I hope you weren't storing your Accounts Receivable data in that NoDB.


>Moving past this, going with a simple key/value store instead of an RDBMS means giving up at least three of the ACID properties in favor of nothing more than a speed gain.

This is, at best, an overgeneralization and at worst outright FUD.

>The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties.

http://couchdb.apache.org/docs/overview.html (to pick a single counterexample)

I have no idea what anti-NoSQL things you've been reading, but they've been wildly incorrect. And as to guarantees of what info an object has, schema-based NoSQL databases exist, just as schema-free RDBMS databases exist. Triggers frequently exist - you can automatically create a non-deletable (permissions exist too) home page for every user at signup.

You're entirely correct that DB joins taking a long time is similarly wildly incorrect, though I'm not certain that's what they're claiming. It sounds to me more like they're saying pulling all related info on primary object X in your system takes a long time in a highly-normalized RDBMS compared to a document-based DB where it's all in the same location. Which is frequently correct.


I agree that for most requirements I have, an SQL database can deliver the data I need, quickly, unless I am missing an index or using an ORM tool.

For Atomic, put all the related stuff in a document and commit it.

Your issue with home pages has a couple of obvious solutions:

1. Deal with the fact that some users don't have a home page and consider them not yet signed up.

2. Create the home page first, before you create the user

3. Put the home page in the user document, or put the user document in the home page document.

Going with a NonSQL store gives you more than speed, it gives you simplicity and flexibility along some dimensions.

We run our SQL database with replication and the replication is not synchronous. If we really loose our primary (bomb blast), the replica can be out of date and we can loose data. We are a very large company, and all the databases are configured like this. Many NoSQL solutions (Google) are configured more conservatively.


Ok, serious question. I have been writing SQL for 7 years now and can normalize schemas with the best of them; but I am wanting to learn the proper techniques for creating scalable data sets like what one would store in GAE BigTable or EC2 SimpleDB...

Let's say I have a book review site. In my SQL world I have tables of 'users', 'authors', 'books', and 'reviews'. When a user creates a review, it creates a new row in 'reviews' which contains the userid, bookid, and rating. Throw some JOINs together and I can pull out rows that have the user data, author data, book data, and the rating all together.

What do you do in non-normalized world? When a user creates a review, just insert a row into a 'review' table that has user_id, user_name, author_name, book_title, book_isbn, book_etc, ..., rating ??

I can see how this would be a much more scalable table and fetching a review is just one row, but it just seems like so much duplicate data...? Or maybe I am thinking about it the wrong way?


you're not, denormalization implies duplication (though not all normalizations are created equal). It also usually implies that you have to replicate DB functionalities in the client (e.g. write data in two places yourself).

But denormalization is not only a mean to "infiniscale" storage, you can use it even in smaller scale to just get better performances in certain things.

As usual, everything is a trade off.


You'd probably do the same thing expect you add a column users.reviews and books.reviews that are lists of ids. For example Joe.reviews would contain a list of reviews by Joe. Essentially you're doing what a good database would do for you...

Your review table can still only contain the user_id, book_id, and rating.

While using an RDBMS, instead of using joins, you want to get the user and book as separate objects anyway. The three objects (user, book, and review) should be cached separately, so you can use them in the duplicate places around your site. Of course, where you have a list of items to get users or books from, you will want to query your cache and DB using multi_get or in (id_1, id_2, ..).

The idea behind these scalable sets are that we end up caching our data differently than how we store the same data in a relational set. Later, our code is checking our cache, first, in one fashion and then the relational store differently; why not query both the same? We can now "simplify" how we think about and consider our data models, whether it be a k:v, document, graph, or otherwise. We end up building these models on top of the RDBMS anyway, in our caching layer.


I think that his arguement is a bit flawed. Just becuase you can come up with one arguement why joins dont work dosnt mean they are never the best solution. I know that for my day job they are the correct solution because I need the data to be transactional and correct. I shudder to think how much effort it would be to port them to a keyvalue store and maintain the same levels of data integrity.

I don't know what 'I need the data to be transactional' means. One way to get isolation is to put a transaction in one document. Often this is easier and more flexible that trying to take the 'car to pieces' everytime your park it. It is also more efficient in network bandwidth.

Everyone wants their data to be correct for some definition of correct. Normalized data is incorrect if the expectation is that relationships are preserved as the author created them, rather than re-created as they would be given the current database state.

For example, say a financial transaction record has an account holder and their related address.

For a naive implementation, if the address on the record should stay as it was when the transaction was created then:

  * The normalized implementation is wrong 
  * The de-normalized implementation is right
If the address should be resolved at query time:

  * The normalized implementation is right
  * The de-normalized implementation is wrong
A lot relational tools (Foreign Keys, Unique Keys) become more complex for temporal schemas, which address these kinds of issues.

Although I understand your point, I think you're confusing things by saying that you will get different answers depending on the degree of of normalisation. Strictly speaking, if two databases differ only in their level of normalisation, then they will store the same information, just in different places. When you update the denormalised version, you will have to update the same values in multiple places.

If you don't want to update that value everywhere, then what you have isn't denormalisation, it is actually a different data model, which stores different information. Since you wouldn't be able to get to the "normalised" version you describe without losing information, it is probably wrong to call it "denormalised" in the first place.


I can only conclude that this is an ironic statement implying that we need database joins to keep a web server responsive...

Meh. If you don't have "database joins", meaning SQL queries that have the word JOIN in them, then you end up effectively replicating the join in your glue code somewhere (either on the insert side or on the query side). And if you put the join in the code, it is algorithmically fixed; there is no chance that the database can optimise this for you.

On the other hand, if you write your query in SQL on a moderately normalised database, you have the potential for the database itself to figure out what to denormalise, and what to parallelise, and any other sorts of optimisations it might do. It just has to know what the typical queries are.

There's a lot of power in using a declarative rather than an imperative paradigm.


I looked into using some kind of NoSQL solution at work, and one thing puzzles me. Assume a key/value database, where values are JSON documents or something similar.

Using such a database, with the key being customer account ID, and the value being a document that contains all the information about that customer's account would be convenient and useful. Information about the account would include things like name, address, email, status of the account, when their subscription to our service expires, re-billing information, all their purchases, and so on.

This would definitely be a lot more convenient than our current SQL-based solution for two of the three things we do with our customer data.

One of those things is display a customer's information for our customer support people when they are dealing with the customer on the phone, or working on a ticket in our help desk. With the NoSQL, it would be a simple get the JSON document for that customer, and display it nicely.

The second thing is providing information to the installed software on the end user's computer. Our software provides a service that can partly be done locally, and partly requires back end support. All the back end information needed to support a given customer's installed software would be in that JSON document, so again the key/value store would work great.

However, I'm at a loss as to how the third thing would be handled efficiently: reporting.

For instance, suppose I'm asked for a quick report on how well a new SKU has sold over the last week. With our SQL database, this is a trivial select from the "orders" and "line_items" table. It is almost instantaneous. With the NoSQL solution, it would appear that I would have to go look at every single customer's entry.

All in all, we've got a couple dozen regular reports, many running hourly, and a few being interactive on-demand via a web interface, that are fairly straightforward and efficient in our SQL-based solution (some might involve joining 6 or 7 tables, but it is still straightforward and efficient).

Thus, at first glance, it would seem that going to NoSQL would be a nightmare for my reports. Many times an hour, it would appear that the database would get slammed by reports that would have to be accessing every freaking record in the database. That doesn't sound like something that's going to make me happy.

How do people deal with reporting when using NoSQL?


Why not keep a count running in the SKU's document?

That could work. However, what happens when I'm asked to break down my report by whether the purchasers of the SKU were new customers or returning customers? (Also, a count is not sufficient, since someone might want the report run for a specific date range).

I could keep two counts in the SKUs document, of course. It doesn't sound very efficient, though, if the procedure when asked for some new report is that I have to go modify the database document layout to specifically support that new report, then do some kind of back fill, and then modify the code that processes orders to fill in the new stuff as orders are processed.

As I said, I didn't get far into NoSQL, so maybe it is a lot easier than it seems to me now.


If an SQL (any) database is fast there are a few possibilities:

   1. There is an index
   2. The database is in memory and a scan is running in process
   3. The query result was precomputed
   4. The query was not that hard
With couchdb you could create a view on sku and week, and reduce to a count.

I'll have to give couchdb a look.

Thought experiment: imagine you are going to port Bugzilla to a noSQL solution. How would you do it? There are some pretty obvious relationships between bugs, authors, commenters and comments.

Imagine a bug that has 100 comments from as many commenters. A normalized soluion with no joins is going to require at least 200 round trips to look up all comment and commenter information, so the need to denormalize is pretty obvious.

So let's say you enormalize all the way and save all relevant information in he bug document. Each bug has comments embedded in it and each comment has all of the commenter's information in it. Now each bug requires one lookup, which is awesome, but what about the fact that commenter information can change? Sure it's a somewhat rare event, but users can have thousands and thousands of comments. When a user changes their email address, do you tell them, this might take a few hours, in the meantime I hope you still have access to your old address? What kind of load do those massive updates place on your system? Unlike reads which are fairly steady and regular, these high load queries will be sporadic. How do you plan for and accommodate the spikes?


Even with millions of records where the user's email needs to be updated - that's not going to take longer than a couple of minutes.

Then again - another solution would be to keep an "invalidated" entities list. It will never be very long - just keep all not-finished updates there. Bug report page goes up to 2 queries (bug + invalidated info list) + any number of entries you need to request by their actual ids.


One idea is that if an email address is updated the bugs are not updated straight away (inconsistency). They are updated as load on the system allows. Then you control the spikes as you see fit, to match your performance requirements.

Another ideas is you keep a list of un-applied changes and post process with those (fix the inconsistency). Banks used to do those when I was at university (balance = overnight balance + todays transactions, plus an overnight batch to get the balance up to date.)

If you original system was normalized, and did not fit in memory, then the joins cost (disk_access_time * items), which could have been 2 seconds of disk io. The notion that a db round trip is worse than a disk io is just wrong.


There's a solution for when data doesn't fit in memory on one box called sharding. Anyway, not trying to make the case that one round trip is worse, but certainly hundreds or thousands can be.

Sharding does not allow arbitrary joins to work efficiently. This is a really big issue with the relation model for data, and flocks of cheap hardware for storage.

Update: people are offering suggestions of how to work around the problems I brought up. My point isn't that the problems are insurmountable, only that it adds (often unnecessary) complexity and opportunities for inconsistency to the system, and the suggested solutions certainly help to illustrate that.

Is's also easy to look at this from an individualist perspective and thin "well, I won't screw that up" and maybe you're right. But what about every single developer in your organization? Are they all going to consider the complexities involved in this system? Are they even guaranteed to be aware of them?


Sets up a straw-man:

  Article table like:
  authorID, authorName, publisher, title
Where an article only ever has one author. Then proceeds to knock it down (summarized and slightly satirized):

>That's not even in second normal form. /makes an Author table. Viola! Future name changes are free.

>But wait! We don't change author names on published articles! * gasp * we don't need joins!

No, you've just demonstrated that blind optimizations are sometimes incorrect. Duh. They are correct in this respect, however:

>The dogma of normalization [too] often leads to over-engineering.

But all dogmatic adherence to X results in sometimes-incorrect use of X. This is true everywhere, not just for RDBMS normalization.


This seems to overlook really hard part where joins become important which is for things that need to transactional, such as how much money I have in my bank account. I might not care if it takes 30 seconds for my change of name to propagate to 300 different papers I have written, but it is crucially important that I have only one total for my bank account and that it is never allowed to fall below zero (simplistically).

We need database joins so that we can derive new, interesting propositions from the data already in the database.

A database is a set of facts about something. Joins are the logic reasoning that lets you deduce new things.

If you never need to find out new things, then you don't need joins. So you can get away without joins when you have a simple system that is doing repetitive work. You can't avoid them when you are doing complex analysis or presenting data in a variety of different ways.


Legal | privacy