This discusses the "how" but not the "why." I'm not a SQL expert, so I find it far easier to maintain my code in a language I am familiar with than to maintain (seemingly cryptic) SQL equations, queries, etc. Is there a quantifiable performance gain with moving more operations to SQL rather than implementing it in my language of choice?
The conclusion states:
> Exploit the power of SQL to write less code because “the best code is the code that was never written”. If it is not written there is no need to maintain it.
But they're wrong, since you still need to maintain all of the SQL implementations...
Edit: I previously said "I'm not a DBA", but some folks seem to be getting hung up on that rather than the rest of my comments/questions.
"we always seem to try to re-create SQL in those languages (e.g. Hive, Presto, KSQL, etc)."
This is largely because of the number of non-programmers who know SQL. Add an SQL layer on top of your non-SQL database and you instantly open up a wide variety of reporting & analytics functionality to PMs, data scientists, business analysts, finance people, librarians (seriously! I have a couple librarian-as-in-dead-trees friends who know SQL), scientists, etc.
In some ways this is too bad because SQL sucks as a language for many reasons (very clumsily compositional; verbose; duplicates math expressions & string manipulation of the host language, poorly; poor support for trees & graphs; easy to write insecure code; doesn't express the full relational algebra), but if it didn't suck in those ways it probably wouldn't have proven learnable by all those other professions that make it so popular.
I'd argue adding SQL into the mix makes it difficult to maintain, mixed-language codebases are almost by definition complex, and you get significant chafing when mixing a declarative language like SQL and OOP.
Since this is a no-update and no live-insert scenario we're talking about, it's fairly easy to produce code that is an order of magnitude faster than a DBMS, since they're not only primarily optimized for efficiently reading off disk (an in-memory hash table beats a B-tree every day of the week), they've got really unfortunate CPU cache characteristics, and additionally need to acquire read locks.
"Use our tool's language instead of the other tool's language because you don't understand the other tool's language well enough to use it effectively and efficiently."
Uh, what?
SQL allows you to do many things more effectively and faster than in-code. It's more than just a persistent store; it uses the power of math to do things faster. As an example, aggregation can be faster in the DB if your data is the correct shape (via group by).
The fact is that, word for word, SQL is incredibly more efficient for reporting than most programming languages. But since few programmers treat it as a real language, it tends to be written without formatting. Which makes it hard to read.
Also unfamiliarity will make SQL feel inefficient for non-SQL programmers. The experience of rewriting SQL teaches some this lesson. But others find that the familiarity of the language of the rewrite makes it seem better to them, even though it is objectively worse. An objectively worse that only becomes visible when someone needs to optimize it for performance reasons. And it is seldom the programmer who did the inefficient write who has the skills to optimize it back to what SQL did in the first place!
The real problem is that you're almost always shifting work from a language that is well known and understood by you and/or your team to one that is less, or even poorly, understood and known, and you end up incurring the cost of novice programmers, which can be a real problem for both security and performance.
If you have good knowledge and experience in the language your preferred version of SQL implements, that's good. If you just have people that understand how to optimize schemas and queries, you might find that you encounter some of the same problems as if you shelled out to somewhat large and complex bash scripts. The value of doing so over using your core app language is debatable.
You can write code in any language without defining functions or modularizing in any way. You can write bad code in any language.
For some reason, folks assume SQL must be written badly since they have only written it badly or seen it written so.
It is absolutely possible and preferable to write maintainable SQL logic into user defined functions, stored procedures, views, materialized views, CTEs, temporary tables, etc. If you're looking at one huge pile of monolithic, untestable SQL, the problem isn't the SQL.
One doesn't write O(n^3) algorithms in C++ and then blame C++ for it being slow. For some reason, folks seem pleased with themselves to do as such with SQL though every day and twice on Sunday.
Got subselects in each of the fifteen outer joins with NULLs all over your schema, and now you're upset performance is horrible and inconsistent? PEBKAC.
Normally the reason you write anything beyond trivial SQL is because you only have a small amount of code to run and lots of data to run it over. Pushing the code to the data is more efficient than pulling the data to the code.
The latter might be conceptually cleaner (though it's debatable, relational is a fairly nice programming model and a lot more consistent and well-founded than object orientation, for one), but it's seldom optimal.
Three orders of magnitude or more speedups are not unexpected by pushing the code to the data.
I use it a lot because I'm too lazy to learn new code syntax every time I work with new a framework or language. SQL stays the same so it's more transferable. Besides, once you become an advanced SQL user, you find queries that are harder to build using code instead of SQL. It happened to me a lot when writing Apache Spark code.
Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time. It's far simpler to just write the dumb for-loops over your data, if you can access it.
Of course not all engineers can operate with SQL as efficiently as code -- that's the whole point. Otherwise why would we be writing code? Learning SQL intimately doesn't change that fact.
I used to write a lot of PL/SQL and your assertion is simplistic at best, and misleading at worst.
First off it depends on what processing you are doing on what kinds of data sizes. If you are doing relatively simple things, sure. But remember that PL/SQL isn't particularly fast if you need to do anything more complicated. Don't do anything complicated (we'll get to that in point three).
Second, someone has to write that code. Writing PL/SQL is laborious because the tooling is bad. Pick any half-decent language today and the tooling is going to be fantastic. You have language servers, you have optimizing compilers, you have profilers, you have debuggers, you have lots of ways to instrument, inspect, and verify code. Where do you plan to recruit PL/SQL developers and how much are you going to pay them? Because in my experience: there really aren't that many usefully competent PL/SQL developers available on the market. There are lots of people willing to "give it a go", but competence is rare.
Third, it complicates your database. Instead of just doing relational math and executing queries it now also runs code that might not be as easy to model on the same cores, eating into your database IO and memory budget and causing lots of variability. Sure you can hound your PL/SQL developers to adopt various tricks in order to lower the variability of database performance, but that gets expensive. And the code gets ugly.
Fourth: It makes no sense to apply extra load to your most expensive component, which also happens to be the component class that most often also ends up being the bottleneck of your entire system. (Cloud provider pricing is a somewhat useful proxy for the cost of different systems. Have a look at the cost of RDS capacity compared to EC2)
Fifth: did you measure or did you assume? In the cases where you measured a clear gain, was it still worth the cost of all of the above? Did you measure the right thing? Could you argued your case with numbers?
There are plenty of devs out there perfectly capable of writing sql (and plenty of DBAs who are actually just developers). It's not rocket science and is entirely based in CS.
It's very interesting how many application developers shun SQL. I believe a lot of it is due to a pervading sentiment that SQL is unruly or inelegant. There are certainly quirks to the language, and it is a difficult transition to think in a declarative rather than imperative manner, but once you make the jump, it's an invaluable skill to have. Fortunately, I have the luxury of being able to manipulate our codebase from any level of the stack. What this means is that rather than jump through hoops on the API or frontend side to accomplish some task that would require reams of code, I can just get my output via a simple query, and it will perform faster in almost all cases.
From an analytics point of view, I can't imagine not using SQL. I've seen people pull reports from multiple websites, text files, etc., spend an entire day manipulating them in Excel, and still not get their data model working as expected, not to mention that it is very slow. A couple of queries with some temp tables and voila, magic happens. It really does make you look like a superhero when you can deliver more accurate results in a fraction of the time it originally took. I'm surprised there isn't more of a market for this skill, surely there's a lot of programmers from the 80's and 90's who have this skillset in abundance.
I agree that SQL is a brilliant data processing language, while C#, Java or C++ are terrible at it. And there are obvious benefits to learning SQL and being able to use it efficiently, makes imho much more sense than throwing away RDBMS because they're 'slow'.
But if you write applications in an oo language objects are quite natural way of representing both data and logic. Sometimes you do application-level transaction management or use the mapping metadata to implement higher-level features, sometimes you have to integrate application events with data operations, or you need your data objects to live outside of database transaction - ORM saves you lots of coding in such situations. And adds services like caching, prefetch, intercept/triggers etc so you don't have to invent it.
The argument is not that you should use application code where SQL is more efficient. The argument is that the sql shouldn’t be in the database itself as stored procedures separate from your application code.
There's also the old/current method of writing SQL by hand. And you're going to have a hard time convincing me that adding another layer of code to anything usually removes bugs. More code = more bugs. Not always, but just about.
The conclusion states:
> Exploit the power of SQL to write less code because “the best code is the code that was never written”. If it is not written there is no need to maintain it.
But they're wrong, since you still need to maintain all of the SQL implementations...
Edit: I previously said "I'm not a DBA", but some folks seem to be getting hung up on that rather than the rest of my comments/questions.
reply