1) It's rare to have enough insight into the internals of a particular datastore to accurately predict how it will perform on a particular workload. Whenever possible, early testing on production-scale workloads is essential for planning and proofs of concept.
2) Database capabilities are a moving target. E.g., the performance improvements to pgsql's LISTEN/NOTIFY are essential to its ability to handle this particular workload. In previous jobs, I've had coworkers cite failed experiences with 15-20 year-old databases as reasons for not considering them for new projects. Database tech has come a long way in that time.
3) Carefully-tuned RDBMSs are more capable than many tend to admit.
I think you're right about number 3. I have a friend who likes to do as much as possible with pgsql and I'm often surprised at how well it performs for the things he uses it for.
I'm not sure if it's fair to read your parent as putting business logic in the db: it may just use Postgres whenever a datastore is required and Postgres is not a bad fit.
Deciding where business/application logic should reside is an important decision, and treating PostgreSQL as an application server in addition to a database does require a different way of approaching the tool. Of course, this is no different than any of the other application architecture choices that need to be made.
I am not sure i understand your comment also :)
My point was that people (read web devs) treat databases (postgres) as only data stores, put in data, get data out, and they are so much more. Sure not anything goes into the db, but if it's a bit of logic that relates only to the data (who can access it, who can change it and in what way), most likely it belongs in the database.
PS: but let's not derail this thread with debates on where business logic should go :)
I think I agree with you that there are some things that belong in the database. For example, anything I can do in the database to ensure data integrity (without unduly affecting performance), I'll put there (data types, constraints, triggers, functions which encapsulate transactions). Similarly with access privileges.
My initial comment was in response to these statements:
I have a friend who likes to do as much as possible with pgsql and I'm often surprised at how well it performs for the things he uses it for.
I understood you to read this as putting business logic in the database (indicated by the And yet) in your comment:
And yet i hear day after day, don't put business logic in the db
I read the initial comment by 'fapjacks as meaning using Postgres whenever appropriate, as opposed to doing everything one could with Postgres in Postgres.
Not a big deal at all. I saw a potential misunderstanding and attempted to clarify.
PS: Of course, business logic should go where I put it, right? ;)
You are right in both cases, yes i read that comment like you described and yes that is probably not what fapjacks meant.
As for the beginning of your comment, i am 100% with you. The problem is that the cases you describe (integrity,access,reports on data) are most of the time what the entire application does (we are all just doing CRUD and all that ...) and instead seeing that for what it is (let's call it DataLogic) people are all to often calling it Business logic, and the moment they put that label on it their brain automatically (after years of conditioning) does not even begin to consider the database as the appropriate place for that type of logic.
Isn't it harder to create well designed abstractions for business logic in SQL compared to a full-fledged backend application server?
I know SQL has stored procedures and the like, but is it correct that SQL is limited in how it can express domain logic, compared to languages that backend application servers are generally written in?
The point i was making was more in the direction of logic in the database, not specifically in pure SQL, it can be PgSql, Python ... does not matter. For the tasks that database side code should be used for (data consistency/access rights/reports on data) i would say they can express the domain logic for those tasks quite nicely, even better then in other languages.
For example i can say "grant select on invoices to accountant" or i can add a constraint on a table like "CHECK (price > discounted_price)" or "CHECK (price > 0)". I do that in one place and i no longer have to worry about checking this (simple example of) domain logic anywhere else.
Most of the time this is what people call domain logic (consistency/access/reports) although i would think a better name for it would be "data logic", heard that somewhere and in a lot of the cases this is almost everything the application does
> I know SQL has stored procedures and the like, but is it correct that SQL is limited in how it can express domain logic, compared to languages that backend application servers are generally written in?
Procedural extensions aren't, really. I mean you can literally just use Python (pg/python) or Javascript (plv8) to write postgres functions.
At one dev position I had, the DBA wrote stored procedures for everything I could possibly want to do with the database, and that was the only way I interacted with it period. I wrote a small wrapper around these stored procedures and the records they returned, and used it exclusively.
It was a much more pleasant experience than having no stored procedures and a full blown ORM, which IMO inevitably leads to sillyness like doing inner joins at the UI level.
We store settings based on a hierarchy in the database using JSONB. Basically:
global -> country -> account -> user
PostgreSQL doesn't support deep merging of json docs, so I just wrote an aggregate function using plv8 to merge them. (the result from the query always results in ~10 records which is then grouped and merged so perf is always fast)
I also had to write JSONB deep merging as part of this RethinkDB --> PostgreSQL rewrite. Here's the code I wrote
https://github.com/sagemathinc/smc/blob/ce594ff0574ce781bf78...
That code constructs a single complicated query; I wonder whether I should write it as a stored function in the database instead. (I'm the author of the blog post.)
I've deliberately avoided deep arrays for settings, so my jsonb settings merge is simply:
SELECT COALESCE(companies.options, '{}'::jsonb) || COALESCE(groups.options, '{}'::jsonb) || COALESCE(users.options, '{}'::jsonb)
FROM users
JOIN companies ON users.company_id = companies.id
LEFT JOIN groups ON users.group_id = groups.id
WHERE users.id = ?
Where I need a hierarchy for key names I'm just prefixing.
I sit right on the knife edge of tossing all the jsonb options and replacing with an old-school normalized EAV model.
> Carefully-tuned RDBMSs are more capable than many tend to admit.
I'd generalize that further: RDBMSs are more capable than many tend to admit.
When it comes to a persistent data store, you've got to go out of your way to justify using something that isn't statically typed with firm ACID guarantees. I'm not saying those use cases don't exist, I'm saying most people don't have them.
"Saving" 15 minutes of dev time because you don't know what your schema is going to look like is going to cost you orders of magnitude more time down the road asking yourself that same question.
Posts like this are annoying and completely stupid.
NoSQL databases encapsulates a wide variety of data stores including Key Value (Riak), BigTable (Cassandra), Document (MongoDB), Time Series (InfluxDB), In Memory Grid (Ignite) and dozens and dozens of others that blur the already grey lines. Many of which are strongly consistent (negating your consistency point). So your experience with one does not at all translate to other data stores.
The equivalent of what you're saying is "PHP sucks therefore all programming languages suck".
I think the point of the post is valid: it can be hard to work with a schemaless datastore. I think Cassandra is the only one you listed that supports type enforcements.
That mantra hasn't been true for a number of years now. Especially since the big trend has been exposing SQL layers in front of NoSQL databases. For example Phoenix in front of HBase.
And SQL as well as all know requires data types to be usable.
Just to explain - I was going to write mongo originally but then decided to go more general. I evidentially went too general with my terminology. I've used mongo since the early days and I've also worked on projects that have used everything from neo4j and redis to oracle as the primary data store.
There are a bunch of things that made me nervous about mongo as the primary db. Data consistency was the main one and, like the author of the post - migrating away from mongo was hard almost entirely because of data inconsistency that had developed over the course of a couple of years.
Querying was also harder on an ad-hoc basis. With a relational db there are generally just a couple of recommended ways of modelling your data. But once you do that, it's easy to query after the fact. Or to add new stuff. With mongo I felt like I was constantly making a trade off that I would have to accept at querying time.
RDBMs may not scale as well as let's say something like Cassandra, but as long as you don't need the scaling properties of Cassandra they make your life so much easier.
I have a feeling that premature scaling is a version of premature optimization. You probably don't have big data. And you probably don't need a massively scalable, distributed system. So, don't try to do it. Just don't.
The simple fact that I know the data I read from the datastore is consistent with my rules, is enough for me to stay on RDBMS. I've encountered enough bugs and failures to not trust a schema enforced in my data layer code.
> "Saving" 15 minutes of dev time because you don't know what your schema is going to look like is going to cost you orders of magnitude more time.
While I agree with the spirit of this, sometimes you'll never know what your schema is, because it's not under your control to begin with. That's why there are triple-stores and "document" databases and key-value stores now, and why there were deductive databases 30 years ago.
The world is wide, and there's room for more than one tool in the chest.
This assumes that JSON is what you want to store and query. In my experience, about 1 year ago, I couldn't get basic things out from the Puppet database.
Even those use cases can sometimes be met by an RDBMS. Postgres can outperform Mongo as a key value, or json doc store. But Mongo is what many people think of for something like that.
And for those who may not know, but appreciate irony, Mongo now uses ... wait for it ... a somewhat stripped down, key-value-ish, embedded ... RDBMS as its underlying data store. Yes friends, "it's a dessert topping AND a floor wax"
should also point out that Mongo is more than a jsonb store, and postgres is, at least currently, less than a distributed, "scale-out" database. Mongo supports problems that postgres cannot currently address. period. (and I'm a huge postgres fanboi)
Horizontal scalability. Outgrowing a single master database (writes being bottlenecked, not reads) is really painful. You can engineer around it, but spinning up another master in a multimaster DB is a really quick and easy fix (once you have tooling built out it's a matter of minutes). Sharding your Postgres DB is not a quick and easy fix. It will likely take about a month to engineer around and perform the data migration, if not longer.
I've pretty much only heard about it in the context of OLAP. If your data model requires transactions, you're gonna have a hard time when you outgrow a database.
Because as someone who has tried to update JSON documents (1M/s) in both PostgreSQL and MongoDB I would strongly disagree. MongoDB with WiredTiger was at least 10x faster per node and in terms of pure updates one of the fastest databases I've ever seen. And before the trolls yes I was fsyncing to disk.
Postgres doesn't fancy many small and frequent updates (like rapid counter increments).
It is workable, though, via combination of batching, ensuyring indexes on hot fields have appropriate fillfactor, and making sure Postgres can use heap-only tuples.
Here's my old SO question on this and a well-comprehensive answer that's still actual and could be helpful for some cases: http://stackoverflow.com/a/1663434/116546
Not worth much. MongoDB is a completely different beast after version 3.0 with WiredTiger being the default.
But I am sure that there are use cases for which PostgreSQL is going to be much faster in particular single field lookups on unindexed fields across all records.
My point is that it's never black & white to say Database A is faster than Database B. It's always use case dependent.
I think it depends on the workload. PostgreSQL does not not support updating documents in place but is really fast at writing and reading. Never done any benchmarking myself, this is just based on my general knowledge of what data structures and algorithms both databases implement.
This is pretty amazing. Now, would you mind showing me a situation when the raw insert performance is the single dimension when we are making a technical decision?
Hint: you probably can't, the reason why you store data is that you can read it (query it) later. MongoDB could be 10000x faster with inserts, still not a considerable solution for us because it fails with many of the other aspect of providing a great data access layer.
Raw insert performance is the main driver for tick databases persisting high frequency market data for HFT systems. The TimesTen DB, later bought by Oracle, was built for this scenario. Arthur Whitney's KDB too.
Not the single dimension but an important one. Any real time application processing events data (IoT, heartbeats, prices logs) would have a similar requirement. That's one reason why people uses queues, to try to workaround this. And as for the reads, it will drilldown to your query needs. I have done systems where the read requirement was a retrieval from a key and the writes request came in thousands per second.
Absolutely, I have run data ingestion pipelines before for event data processing, perfectly suitable for the use cases you mentioned too. I just dont see mongodb as a good fit here.
> "Saving" 15 minutes of dev time because you don't know what your schema is going to look like is going to cost you orders of magnitude more time down the road asking yourself that same question.
Yeah. In my experience, "schemaless" doesn't mean you have no schema. You still have a schema, it's just implicit and you don't have any tools available to actually operate on that schema.
This is pretty similar to the argument for static typing. Better to let the type system (help) prove correctness ahead of time, rather than wait until runtime to debug.
Yep. Everyone winds up with some war story about cleansing some multi-petabyte data store – but the better data engineers I know try very hard to avoid having two of them.
Yeah. I'm a huge fan of dynamically (yet strongly) typed languages like Ruby. They absolutely have their place. But weakly-typed data persistence or data interchange are absolutely terrible.
It's not just about type safety, though. A good RDBMS enforce things like referential integrity (FOREIGN KEY constraints) and allow you to express further constraints on your data (e.g. order.amount must be a positive number, a certain combination of columns must be unique across the table, etc.)
And most (all?) RDBMS that have been around for a while have been tuned and optimized to support this efficiently. I remember reading that at some point (1980s-1990s-ish) DBMS vendors were buying compiler developers like crazy to help them with query optimization and such.
In almost all cases I've seen the schema is simply in the application layer as opposed to the database layer.
The argument for schemas within databases was when you had dozens of users and applications all connected to a database. In the last few years we've seen micro services, Kafka queues and REST APIs replace the database as the primary integration points.
Your schema might sit in code, but you've got to reinvent the wheel on all sorts of validations and guarantees your database would provide. Whether you have queues and APIs is irrelevant, your persistent store should be a strong one.
Schemaless design enables the organizations to collect data now and decide what to do later. Though analysts who work on these datasets should be able to write code to parse the records when needed.
> Though analysts who work on these datasets should be able to write code to parse the records when needed.
Right, so the data conforms to a known schema that they can program against. Schemaless design doesn't somehow make this possible, it actually makes it difficult. If you already know what data you're capturing, then you have a schema.
Schemaless imo is just kicking the can down the road.
You always know what kind of data it is anyway. You don't just dump random data, at least not the majority of the time. When you grab data you're grabbing a specific data group.
It's important to keep in mind that the "collect now and analyze later" is exactly that. You don't have to worry about constraints and integrity, you just capture what you think is of value, which is likely to be all the things unless you're working at really large scales. For all intents and purposes, storing everything as text files is just as good for this task.
> You don't have to worry about constraints and integrity
Sure, you don't have to specify or enforce constraints with a schema either. A schema just means the data has a well-defined structure, it doesn't mean you have to enforce constraints of any kind, though that often helps a lot. Data with no structure at all is very likely useless.
This seems to be a growing issue in the industry as we go more and more data driven.
I don't think most devs seem to realise that just throwing data into a data store, without constraints, integrity checks of any of that other boring stuff, actually makes the data useless.
Try doing analytics against data that can be anything. You can't. Mixing numbers with strings, duplicating fields or having them under varying naming strategies, all just mean you have lots of data that you cannot use.
One of the pet peeves I see data scientists having is lots of data, with no organisation, as they can't use it.
If you want to collect data now that's fine. Collect it in it's original format and label what that format is and where it came from. Then make a proper data store with a proper schema and work on importing the data from those sourcs when you have this foundation sorted out.
It's not always straight forward. A lot of protocols use delta encoding. For example "insert BUY GBPUSD, price=1.1702, volume=100000". But yes, it's worth coming up with a schema on write if at all possible.
But then you can't query schemaless data without sophisticated parsing or absurd wildcard/regex matching of some sort, which obviously is highly error-prone, thus making schemaless data at best only probabilistically useful.
For delta-encoded models, it still seems straightforward to define a general schema via (operation, objectId, value, transactionId) tuples. 'value' can be a string if you want maximum flexibility, but at least this way you can reliably query the operations, entities involved and which changes happened together as a group.
Absolutely. I agree with you, I'm just saying it's not trivial to turn a delta encoded format into a nice schema.
I mean, I would advise against storing deltas as tuples and considering that the schema. If it's at all possible, try to rectangularize your format. If you're using a column store a lot of the redundant data from the denormalisation can be compressed away.
If the rectangularisation is lossy then you may need to archive the raw data. This is the hurdle where a lot of people stop and choose schema-on-read since they can't afford the storage requirements for an archive of raw data and a nice analytic format for querying.
Offsetting database layout decisions is not having better flexibility, it is having less clue about database layout, unfortunately.
Collecting data in some form and re/de-normalizing it in some useful fashion isn't exactly huge problem if you're not NetFlix or Google (and isn't huge problem even then).
The problem with standard'ish SQL+RDBMS and their schemas is that types are not enforced well in SQL and most libraries and language integrations, thus we end up with the quagmire of static schemas that can not be reliably type checked when you use standard tooling.
Also lacking in many DB systems is integrated support for tables withh heterogenous schemas that is supported by page/row-level cersioning and/or on line schema alteration. Having to rewrite a huge table only because you want to narrow a couple field for future data, it gets old quickly.
> The problem with standard'ish SQL+RDBMS and their schemas is that types are not enforced well in SQL and most libraries and language integrations, thus we end up with the quagmire of static schemas that can not be reliably type checked when you use standard tooling.
This used to be an issue with MySQL (and not really any other major RDBMS implementation), but with STRICT_MODE and the default settings on other implementations, SQL is fantastic at representing and enforcing static types. Furthermore, ORMs and tools like Apache Spark have really upped the integration between types in the database and types in languages. Practically speaking, RDBMS is really the only way you can have sane static typing in most applications (esp. ones that are built on dynamic languages).
> Also lacking in many DB systems is integrated support for tables withh heterogenous schemas that is supported by page/row-level cersioning and/or on line schema alteration. Having to rewrite a huge table only because you want to narrow a couple field for future data, it gets old quickly.
This is just not true. All of the major commercial RDBMS systems support both online schema alteration AND flexible/semi-structured types (JSON, XML). Again, MySQL and Postgres are behind the curve on this (although they both now support JSON). Commercial systems like SQL Server, Oracle, and MemSQL all have both.
I'm not sure if I read your comment properly. But you said SQL Server has both? Both JSON and XML?
Yes SQL Server has json, but it's not real json support. It's an ntext column with some json functions... support for XML is great but the sql Server team refuses to support json properly because "we invested in xml and no one uses it"
PostgreSQL on the other hand has a real json type for a Long time with query and index support. As well as xml. In this regard, json support. PostgreSQL is actually ahead of all rdbms...
PostgreSQL may be behind in some other areas but not here. PostgreSQL has some of the best support for online schema changes (transactional DDL for almost everything, lots of work put into reducing lock level of schema changes) and was the first to implement support for JSON.
Please try to alter a schema of a live multibillion rows table before pretending that it works. Unless they changed how is implemented recently then such an operation can take hours.
I love postgres for many reasons so i would like to be proven wrong.
Depends on the schema change. If you have a multibillion row table there are plenty of other things you need to pay close attention to as well, as you're probably all-too aware.
Some schema changes can take prohibitively long on large tables, so you do have to be careful. But there are ways of achieving the same results that are safe for production database. Braintree did a decent writeup on it. https://www.braintreepayments.com/blog/safe-operations-for-h...
The bigger the database the more complex it will be to implement schema change while keeping track of transactions that happen while your are transitioning. However I see no reason why it shouldn't be feasible given proper amount of preparation. And if short downtime is a goal I don't see why a proper amount of time developing a solution wouldn't make it either.
And on top of that PostgreSQL community is implementing new way to solve problem with each release. I'm not in a huge live-data use case so I might be wrong, but upcoming logical replication look promising for such use cases https://www.depesz.com/2017/02/07/waiting-for-postgresql-10-....
I have never worked with a database of that size but I have done major refactoring of tables with about half a billion rows. And while it can take hours for the schema changes to finish it is hours without any downtime (or just a couple of seconds of downtime during when the exclusive locks need to be acquired). It is tedious to have to wait that long (especially if one has 10 times the rows than what we did), but PostgreSQL can do almost any change online just fine.
What kind of schema changes do you think of that would require that much downtime? Even in complex cases that can be avoided with triggers or even rules.
When doing the rewrite, I was extremely surprised by how good PostgreSQL's JSON support is; there are a large number of functions for querying nested structures, and the indexing is very powerful. I mapped my data from RethinkDB documents to mostly relational data, but kept a couple of columns as JSONB for now, which saved time and massively simplified the rewrite. I remember trying to fully model similar data back in 2007, and it got overly complicated for no real benefit...
> online schema alteration AND flexible/semi-structured types (JSON, XML).
> Again, MySQL and Postgres are behind the curve on this
You may want to remove PostgreSQL from that list, especially in 2017.
PostgreSQL has supported XML for years (long before it supported JSON). And online schema changes are not only possible, but even transaction safe, which is especially nice for migrations.
This problem was definitely there for certain SQL servers when inserting incorrect values causes the server the silently cast the data to the field definition type. MySQL had something similar 10 years ago. I think in 2017 most of the SQL servers handle types properly. Adding support for more advanced types like UUID or custom types is a great thing and I see the use of it in production every day.
I dont really see the case for heterogenous schemas, I usually create a view of two very similar tables by creating a union with null fields. This works well enough.
I'd probably change this a bit by saying that it's not a problem to have the schema in the code (it's got to be somewhere, preferably version controlled), it's a problem when your schema reflects (or even worse, is) your business logic.
I'd agree entirely. Unless you have an exceptional reason, in which case you probably know what you're doing already, build a version of your system which uses either a relational database (with a sensible schema, not a single-table EAV thing) or a filesystem as a backing store first.
(When it comes to a primary data store, its first job is integrity, so denormalization is premature optimization.)
Similarly; don't distribute any computation you don't have to distribute. Just getting a bigger machine is remarkably likely to be cheaper, more operable and more reliable.
File systems are far harder to use correctly compared to pretty much any database; suggestion: only use them for storing large blobs, and use collision-free names (eg. random UUIDv4). Many applications don't need that though.
Sometimes you just have large blobs, though! (I'm thinking, for instance, of machine-learned models; big matrices are exactly the kind of thing we have HDF5 and friends for.)
I think it's important to distinguish two cases. If mainly one app is the storage client, then you usually have the schema in the app, no problem with schemaless (most early phase startups).
If you have many different server apps accessing the database, then everyone has a different view on the schema which leads to high coordination overhead.
I will say this can go the other way in the hands of someone who doesn't understand what they are doing.
I recently started work at a startup where the codebase is running PHP/MYSQL which in itself is ok, but the database was horribly designed.
To deal with the tracking of "events" they constantly receive from devices that vary a lot from device to device it uses a linker table for every field that can be stored aside from ID. Considering the data never changes once inserted, it's really seems like overkill.
Any meaningful query that goes beyond getting a single datapoint involves dozens of joins which just kills the database. If someone just made him use mongo for the events storage we would probably be in a better position to fix the issues we are dealing with.
Though I agree with you there is the problem that RDBMSs can't handle hierarchical data very well. This comes up in enterprise master data quite often, eg. in employee relationships or organizational structures. I'm no SQL wiz but I have yet to see a readable and well-performing SQL which can select the managerial line of an employee.
> I have yet to see a readable and well-performing SQL which can select the managerial line of an employee.
This is not a simple problem - I'm not even sure how to map it mathematically to set theory. SQL is based on set theory, and does so very well. What you describe is a graph problem, which simply falls outside of what SQL was designed for. Either a RDBMS needs to be extented to handle data in graph form and operate on them acording to graph theory, or you should use a DB suited for this problem.
Point is, ideally you want both. Especially master data is closely entangled with data which you have in tables (and rightly so). In the example above consider an order header table which will have a reference to an employee (the one working on the order or the one who created the order). The schema for orders is rightly relational but I'd need a join to the employee which therefore is stored in a relational table as well.
I don't see how that scenario is anything but relational, even for queries, or are you talking in extension of the previous post about the managerial line? If so I agree, but instead of "poluting" the RDBMS with graph related constructs, I think the option here would be save the data in the RDBMS, and either build a graph on demand, or maintain an in memory graph, for queries. "Everything but the kitchen sink" are rarely a good idea for tools, it has a tendency to make the cost of switching some module unacceptable.
A PostgreSQL (recursive) WITH query [1] can deal with that:
CREATE TABLE employee (
id int primary key,
name text,
manager int references employee(id)
);
INSERT INTO employee(id, name, manager)
VALUES (1, 'jane', null),
(2, 'john', 1),
(3, 'jake', 1),
(4, 'jeff', null),
(5, 'jessica', 3);
WITH RECURSIVE t(manager, managed) AS (
-- Direct managers
SELECT manager, id
FROM employee
WHERE manager IS NOT NULL
UNION ALL
-- Indirect managers
SELECT employee.manager, t.managed
FROM t, employee
WHERE t.manager = employee.id
AND employee.manager IS NOT NULL)
SELECT
(SELECT name FROM employee WHERE id = t.managed),
array_agg(employee.name) AS chain_of_command
FROM t, employee
WHERE t.manager = employee.id
GROUP BY managed;
Results of the query:
name | chain_of_command
---------+------------------
jessica | {jake,jane}
jake | {jane}
john | {jane}
(3 rows)
You can argue about the readability (syntax highlighting would help), but to my eyes it isn't that bad (if you know how WITH queries work). It's also more concise than the corresponding query would be in many a procedural language.
The efficiency should be pretty acceptable too (instead of having direct pointers O(1) to follow the manager, you follow them indirectly through an index O(log N)).
Recursive CTE can be used for this kind of queries. Or you can use nested sets / intervals depending on your workload and the tradeoffs you're ready to make.
> 3) Carefully-tuned RDBMSs are more capable than many tend to admit.
So important. I have, in the past, worked with many people who simply thought NoSQL was new and RDMS was old therefore ALWAYS use new. Trying to get someone to pick an RDMS solution in said environment was impossible. The times I've had to take a document store and make it store relational data are more than I'd like to admit.
I've never worked with postgres before my current project and I'm so impressed with how easy it is to get stuff done. I used Cassandra for a prototype (bad choice for my workload, not Cassandra's fault) and documentation, ease of tuning, tooling maintenance all seem to be so much easier. Of course, Cassandra has a host of problems to solve that postgres doesn't -- being meant to run on clusters and all. But you know what, I don't care. I also was heavily leaning new stuff before. But what can I say, if an RDBMS is what I need -- I probably should use one.
I have a pet theory that any given system has a maximum cleverness budget. Make sure to spend it on the things which actually matter in your problem domain.
Interesting data products often have to spend a lot of that cleverness budget in the algorithmic layer – machine learning and statistics. So when I work on that kind of thing, I've learned to have a real appreciation for boring data storage. I'll even happily use MySQL; the ways in which MySQL can suck are very well-understood and that's a hugely underappreciated virtue.
On the other hand having the right kind of data storage can greatly benefit some algorithms. For example a graph database might work better with some graph based algorithms.
What are some resource one could use to learn 3.)? I know the very basics of postgres since I am a web dev, but I like to learn more about fine tunning rdbms for better performance.
The PostgreSQL manual [1] is an incredible resource. The first 100 or so pages give you a lay of the land. If you read those you will be lightyears ahead of the average web dev. After that you can dig into other chapters/topics that interest you.
Even though I've never sat down and read it, I can say with confidence that the Postgres manual is one of the most impressive manuals I've ever seen. They could have published the manual as 2 books, and I would have paid money to read them.
I always appreciated that Postgresql documents their differences from standard SQL. When I want to know the standard way I often revisit their documents, in addition to those of the database I'm using.
And yet when I see data engineering positions, particularly around here, it's all about Hadoop/HDFS, Spark, Cassandra, Mongo. I'm not saying there aren't use cases for those, just that it seems like they've become the default basket of options rather than starting with an RDBMS and figuring out why you shouldn't use that.
I feel like a dinosaur working with Teradata all these years, but it hasn't failed me. Sure we spend some extra time modeling, and then some extra when we need changes, but other than the cost, it's really been solid.
That cost that you glossed over is massive. Teradata is very, very expensive. It's great. It works. But it's expensive. Especially if you want to store large amounts of data e.g. networks or web traffic.
Fair enough. I've been in large companies for too long, it seems. We'll pinch pennies over things like coffee and continuing education, but we're still paying those Teradata and SAS bills.
RDBMS are more capable in multiple aspects; may it be speed, may it be functionality.
I often see developers doing all kinds of things in the application layer; enforcing constraints, manipulating data, even sorting and filtering. It's like some folks are feared of writing SQL or don't trust their database.
Ideally RDBMS would be an integrated part of the development environment so that you could refactor your code and your data/schema simultaneously. Unfortunately our IDEs aren't mature enough. Perhaps RDBMSs will gradually evolve into IDEs that combine application logic and data management into a seamless whole.
> > I often see developers doing all kinds of things in the application layer; enforcing constraints, manipulating data, even sorting and filtering. It's like some folks are feared of writing SQL or don't trust their database.
> Maybe they got spoiled by ORM mappers that anything beyond standard CRUD may feel cumbersome to write.
A decent ORM[0] will use the constraints expressed in the application to reflect those as constraints in the DB schema as well.
Why the duplication, you ask? For the same reason that many of the same constraints are imposed in the front-end code: Responsiveness.
Validation that happens in the front-end avoids a request/response to the server just to give the user feedback that their password isn't long enough.
Relationships and constraints expressed in the application can avoid querying the database just to get the same kind of feedback.
This duplication may seem useless, pointless, repetitive, and redundant, but not all accesses to the web server will be via a JS-enabled client, for example anyone accessing the application via an API.
But it's not like he wrote RethinkDB and then wrote Postgres with the benefit of hindsight. When he says that the big change was the database and not his own code, I believe him.
I believe him, too. I'm not trying to participate in any database war. I've only said that any rewrite is going to have the benefit of hindsight, and because of that, his results are going to reflect that to a degree.
He outlines several places where he had to do something because postgreSQL doesn't do it, but he suspected before hand that RethinkDB didn't do that well.,,
This gives me hope that maybe we could see something similar with Datomic. Perhaps it is possible to implement the same append-only (this is the best way I understand it), immutable audit trail on top of PostgreSQL and still walk away with SQL (datalog is neat-o but has a learning curve).
Technically PostgreSQL already has this in terms of the WAL log. The problem is that "rolling back" to a given point of time requires recreating the database from the WAL log so it's not useful in the same way.
There are other "remember everything" schemes you can play with clever triggers, but it always comes back to how you end up using the stored data and how easy it is to bring it back to a queryable state.
PostgreSQL also has it in the form of its MVCC table structure. If you never delete 'old' rows, you could query old states of the databases by ignoring rows with newer transaction IDs.
However, nothing about pgSQL is designed for this approach, so i imagine the performance would be terrible.
Our proposed approach is to treat the log as normal
data managed by the DBMS which will simplify the
recovery code and simultaneously provide support for
access to the historical data.
...
3.3. Time Varying Data
POSTQUEL allows users to save and query historical data
and versions [KATZ85, WOOD83]. By default, data in a
relation is never deleted or updated. Conventional
retrievals always access the current tuples in the
relation. Historical data can be accessed by indicating
the desired time when defining a tuple variable.
...
Finally, POSTGRES provides support for versions. A
version can be created from a relation or a snapshot.
Updates to a version do not modify the underlying
relation and updates to the underlying relation will be
visible through the version unless the value has been
modified in the version.
One of the purposes of the much-maligned VACUUM command was to push the historical data to archival (optical) media.
The archival store holds historical records, and the
vacuum demon can ensure that ALL archival records are
valid.
> POSTQUEL allows users to save and query historical data and versions [KATZ85, WOOD83]. By default, data in a relation is never deleted or updated. Conventional retrievals always access the current tuples in the relation. Historical data can be accessed by indicating the desired time when defining a tuple variable.
Or, to put it another way [1]:
"Since one can't change the past, this implies that the database accumulates facts, rather than updates places, and that while the past may be forgotten, it is immutable."
One of Postgres' proudest features actually used to be this -- "time travel", which used MVCC + row time stamps to allow the client to query historical data. This was eventually removed, but the functionality can be emulated using a contrib extension called "timetravel" [1].
Maybe you're looking for event sourcing? It's great in theory but for all the reading I've done I haven't been able to find any case studies on it working well in practice
I think Eventsourcing is mostly used by large enterprises that don't really blog much about those things.
Let me just say: it's a very interesting approach, but it's also very complicated and has a large overhead in development time and infrastructure complexity.
For most problems, it's A LOT easier to do classic CRUD + some distributed task queue.
You can combine those two. It's pretty easy to just emit an additional event for each write into some event store (Apache Kafka, Postgres, whatever), so you can get a 'best of both worlds' state.
An append only audit trail would also have to carry some type information, since tables can change shape. It makes the relationship to ordinary relational querying interesting, to say the least.
Postgres's logical replication slots do provide way to implement your own change streaming.
> I didn’t seriously consider MySQL since it doesn’t have LISTEN/NOTIFY, and is also GPL licensed, whereas PostgreSQL has a very liberal license.
GPL/AGPL is perfectly permissive and doesn't require any sort of special disclosure if you're just running a vanilla distribution of a server without modifying its code...
In the actual article he mentioned a big potential customer who had license limitations... no AGPL! I can undertand someone not wanting to limit the sales audience.
That particular quote was in relation to RethinkDB though, not MySQL:
> Then I was in a very long and intense meeting with a potentially major customer for an on-premises install, and one of their basic requirements was “no AGPL in the stack”. With the RethinkDB company gone, there was no way to satisfy that requirement, and my requests went nowhere at the time.
The point being that the AGPL licensing of RethinkDB caused problems selling SMC to customers. So why invite those same problems again by replacing it with other GPL'ed software?
I don't know. But if you read further down the article:
> All the code I wrote related to this blog post is – ironically – AGPL. Basically it is everything that starts with postgres- here.
All the code they just wrote for the Pgsql rewrite is AGPL. Obviously they can relicense since they own the copyright, but it's weird to mention MySQL's GPL license in passing as a reason they didn't went that way yet put your own code up as AGPL.
One of the motivations to use Postgres was that it was not AGPL, while the associated code is AGPL; this is what drives the inclusion of the word ironically, does it not? In my reading, the author is very aware of the inconsistency.
It's not much different from the GPL. The AGPL closes one loophole in the GPL and that's pretty much it.
The GPL license essentially triggers on distribution, but if you have some SaaS app it's not clear that this falls under distributing your code since you're accessing what the code produces over a network. This is usually known as the "Application Service Provider" hole in the GPL.
With the AGPL that hole is closed and you have to be able to receive the source code for that app too.
While the author didn't mention the GPL specifically, he did mention that licensing was a consideration:
Then I was in a very long and intense meeting with a potentially major customer for an on-premises install, and one of their basic requirements was “no AGPL in the stack”.
Perhaps GPL was close enough to AGPL that it made MySQL's license problematic.
I think the (A)GPL related issues around the RethinkDB code might've left some bad taste in the mouth of certain people. There also are people that for various reasons, well informed or not, don't want (A)GPL code in their stack which might limit the ability to sell this.
However, I don't see anything related to the use cases in the article that would make GPL be an issue. Unless they want to build proprietary extensions to the database. Considering they've released the code related to the Postgres rewrite as AGPL that seems even stranger.
A really nice thing about code written from the 90s and earlier is that it was designed from the get go to be perfomant, because it had to be. No one in conventional sw writes code like that anymore, not really. eg. When was the last time you used a profiler?
I recently updated from Fedora 24 to 25. I noticed a big performance drop until I shoved more ram into my desktop, and now it's fine again. I can't be certain but I'd wager that this might be because F25 is the first Fedora to use Wayland (over X) by default. X might be old and fugly but it was certainly written in an era where it had to achieve a certain baseline level of performance.
to be fair, wayland improves upon X in many other areas other than performance and while this might be true for the machine/workloads i did not notice any slowdown running it in a vm (i also don't use that much ram, while an editor/ide might be hungry they only eat a few gigabytes. I rarely run things that really eat ram for lunch).
It's true that Wayland uses more RAM and if you don't have enough swapping will kill you. But in this case (and others) RAM use is inversely correlated with performance. For example Wayland uses more RAM as full backing buffers for every window, and this actually makes moving windows around faster because it doesn't have to ask the app to re-draw exposed areas. Unless you spill into swap, then you are indeed screwed.
Xorg with a compositor does the exact same thing with full backing buffers for every window. The only way X is using less RAM is if you run without compositing.
If you're not using a compositor, you must be either living in the 90s or using a device without a supported GPU. Compositing is not new, people were excited about Compiz desktop cubes and wobbly windows back in the mid-2000s!
A Wayland compositor running directly on EGL should always be faster than Xorg with a compositor. Xorg acts as an awkward middleman between your app and the compositor, wasting time on buffer copies.
I use Xcode profiler on completed new features every time before I submit a pull request. If you do iOS/OSX development, it is always a good idea to use Xcode profiler to catch memory leaks/abandoned memory in order to minimize app crashes.
> A really nice thing about code written from the 90s and earlier is that it was designed from the get go to be perfomant, because it had to be.
I don't think that's really the case in PG. Sure there's some places like that, but there were a lot of fairly fundamental performance issues only fixed in the last year. And there's still a lot of things to be done.
Additionally a lot things that you had to do 20+ years ago, aren't the ones that you have to do today to get good performance. Being careful about cache usage and pipeline stalls became a lot more important on recent-ish CPUs than earlier ones.
> No one in conventional sw writes code like that anymore, not really. eg. When was the last time you used a profiler?
I think that's more a difference between application and infrastructure pieces of code. It's only worth spending time (and have skilled enough people) doing detailed optimization work if $project is going to be in the bottleneck for a lot of people. A couple of days on postgres can result in a lot bigger savings, if you multiply the saved optimization time over all its users.
Good read and yes, postgres is king, no argument there but i do think the comparison is a bit unfair to RethinkDB. Work on Postgres started in 1986, you can't compare performance of tools when one of them had 30 years to work on performance and the other is like 5y old which had the benefit of only a handful of brains working on it. I would say it's remarkabl what RethinkDB did in this timeframe.
In relation to Postgres and real time messages, another approach is to use a real messaging server instead of using only the simple listen/notify interface pg provides. It's possible to connect them using this https://github.com/gmr/pgsql-listen-exchange
I am just wrapping up the integration here (http://graphqlapi.com) and so far it looks good. Postgres provides the power and features we all know and rabbitmq gives you all the realtime capabilities you need, and you can route messages in complex ways and have them delivered to a whole bunch of clients.
You are right, that was not a good choice of words. What i was getting at is that the article reads as "i was constantly fighting Rdb and Pg solved all my problems" and even if ppl don't make judgements on the developers they do so on the tool and that kind of remains in your brain, you attach a stigma to the tool and never consider it even though it will get a lot better if you just give it time. For example, i tried pg back in 2000 something, did just a stupid benchmark on how fast it does a count, compared it to mysql and never looked back, so for 10 years i ignored pg because of that stupid test and because in my brain there was this idea that it's slow. Thank god i somehow got over that ...
Might be wrong, but I feel like you may have a vested interest and hence read the article as more of an attack than it really was.
To me it reads more like "I understood the problem space better during the rewrite so I could solve things in more apppropriate ways with a the tools at hand."
Yes you are wrong :) i do not have a vested interest in RethinkDB, I actually do have a vested interest in Postgres :), so i was kind of defending the other guy although i am PG all the way (read my other comments, you'll see i quite like PG, so much so that i built a tool/business around it)
Of course you can. Unless the newcomer gives me the $800/month the OP saves now with pgsql, performance is one of the key metrics. Nobody cares if you can ride your bicycle hands-free if you come in last in the race.
Do I understand correctly that the author went from a distributed database to a single-master scenario? That's a valid tradeoff, but I'd clearly describe it as such.
My experiences with RethinkDB have been rather positive, but my load is nowhere near that of what the article describes. I agree that ReQL could be improved, I found that there are too many limitations in chaining once you start using it for more complex things.
But the two most important advantages remain for me:
* changefeeds (they work for me),
* a distributed database that I can run on multiple nodes.
I do agree that PostgreSQL is fantastic and that SQL is a fine tool. In my case the above points were the only reasons why I did not use PostgreSQL.
EDIT: after thinking about this for a while, I wonder if the RethinkDB changefeed scenario is doable with the tools in PostgreSQL: get initial user data, then get all subsequent changes to that data, with no race conditions. Many workloads seem to concentrate on twitter-like functionality, where the is no clear concept of a change stream beginning and races do not matter.
You say that you did not have such loads, so why did you need multiple nodes then? Why the complexity when databases like Postgres can do 1.5M queries per second on a single box?
StackOverflow is running just fine on basically one big mssql box (the other is just a standby replica, if i remember correctly)
* you are solving problems you don't have
* you can have replicas with postgres also
even if you do end up getting the load you are hoping for (though i doubt you will have 1.5 million qps), reads are easily scalable with replicas, and writes are also possible with sharding or tools from CitusData
I don't have of the top of my head other examples but i know PG is used by yandex.ru which is like the google of russia, and they migrated from oracle and they are very happy, so if it works for them i think it will work for you :)
It's much easier to write in a new database in the future, than it is to deal with greater complexity initially.
I find that in the first phases, I often don't know what kind of load is going to be where, so I simply use a relational database since they give excelent data security, and okayish performance for most things under low load.
I think the inverse is actually the case here. If you need changefeeds, use Rethink up front rather than write your own implementation. When you hit a problem with its performance, then think real hard about your domain and see if it's possible to replicate the safety guarantees it gives you on another DB.
Unless you have a workload that won't fit on a single machine, a distributed database isn't really an advantage over a single master with one or more appropriate replicas that are available for failover.
I strongly suspect that the author has ignored race conditions in his changefeeds implementation.
Atomic changefeeds was the thing that made me start using RethinkDB. I'd been excited about it before then but only started using it for production workloads once that feature shipped.
They'd be possible to implement in Postgres with some kind of monotonic counter that increases on every update of every row in the table. That would be pretty expensive though.
Atomic changefeeds were a big addition, but resumable changefeeds are still a big gap. Something like what Kafka has. You can emulate this somewhat with atomic changefeeds and updating documents to mark them as "processed".
> EDIT: after thinking about this for a while, I wonder if the RethinkDB changefeed scenario is doable with the tools in PostgreSQL: get initial user data, then get all subsequent changes to that data, with no race conditions.
It's easier if you do things in the other order, and can make assumptions about the structure of your data. In my implementation, I have to first start listening for changes, then do the initial query (relevant code:
https://github.com/sagemathinc/smc/blob/ce594ff0574ce781bf78...).
As you say, race conditions aren't an issue for some applications. For me, (slightly simplifying) the main table that involves changefeeds is a table of (timestamp, patch) pairs, and for it, race conditions aren't an issue -- you get the data in whatever order you want and merge it on the client. I'm definitely making no claims to have implemented changefeeds for general PostgreSQL queries or general data.
I am concerned that there is an edge case with one of the tables where there is a race condition that causes trouble, in which case I'll have to explicitly change the schema in order to account for this. Somebody below writes "I strongly suspect that the author has ignored race conditions in his changefeeds implementation."; that's not exactly true, since I'm worried about them and try to structure my data to account for them.
How far has postgres come w.r.t. setting up a cluster with automatic fail-over and recovery? I didn't see the author address this aspect of Rethink that has a lot going for it.
Thanks. Indeed, I'm just using one PostgreSQL master, with no HA, but plan to set something up later, and have put repmgr on my todo list. (I'm the post author.)
Yes. It's currently using 80GB of disk, so regarding disk space I can scale for years. I have tiering system where older data gets stored in Google cloud storage, and grabbed on demand. So regarding space, there isn't an issue scaling up. It's also trivial to increase disk space on live VMs on GCE.
> So you scale the database by switching to a stronger machine?
That's my plan. We're at about 50% overall load as I write this, and here's the usage on the database server: "load average: 0.09, 0.16, 0.17". We're collecting historical Prometheus and other monitoring about load and queries, etc. By the time we need 32 cpus (the GCE limit), either the GCE limit will be raised, the company will be dead (no way in hell), or the company will be wildly successful. So in the only case in which I need to scale out, it will be an exciting new problem that I can hire a team to help with, and we'll benefit from understanding exactly what the problem is we need to solve.
I have wasted an enormous amount of time until now due to premature worry about scalability, fueled by my naive assumptions about how quickly SageMathCloud would grow.
Clustering and automatic fail-over and recovery is not always a great idea. It sounds like you want it implemented so you "don't have to worry about it", but fail-over and recovery systems are a lot more complicated and troublesome than an alert, some read-only degraded operation, and a manual promotion.
Aphyr's "call me maybe" series shows just how hard it is to get automatic failover and recovery working just right. If you're not ready to become a full-time expert on the topic, and pay for more servers for the same load, then simple replication and manual promotion can result in less downtime and less dataloss. I've seen multiple small startups have cassandra clusters fail, because they did not maintain them properly. I've seen them mysteriously lose data in elasticsearch clusters, for the same reason.
That said, GitLab has recently shown that even simple replication can be screwed up. I'd say, the lesson is, the most effective thing is to keep it simple (not "easy"), and avoid disaster caused by mundane mistakes.
Yeah, if you can wait another year or two, something like CockroachDB [1] might fit the bill. That is a database that is distributed from the start. So you (theoretically) won't have to worry as much about individual datacenter failures, and scaling.
Couchbase appears to require failover as it has authoritative servers (three, it looks like?). CockroachDB should scale linearly with cluster size, meaning failover isn't a thing--it just struggles until you give it more CPU/disk/mem/what have you.
We had serious operational issues with Couchbase and some funny developers where always ready to answer our tickets with "works on my laptop". One we had 1.3 billion documents in the cluster and it became unstable and we had to delete all of the data to bring the service back online. Luckily we used S3 as the source of truth based on the operational characteristics of CB. I would not recommend it for anybody who is not ready for massive dataloss and continuous operational issues. Some of these might be been fixed in the meanwhile though.
Yes, I've liked the design from the very start, and they've got a nice group of people working on it. It's supposed to go into a 1.0 release this year.
I'll probably try it out on a service where I'm writing simultaneously to two different databases, one local and one distributed. And then log any issues where I see an inconsistency. I won't be relying on it for a while for anything critical.
My opinion is evolving on this matter. I've seen a more than a couple of developer introduced bugs in our application that cause data loss (improper error handling, swallowing errors instead of retrying, etc.) We use RDS with automatic failover and redislabs cluster with automatic failover. I'm 99% positive we have lost data before, and it makes me very uncomfortable to think about. But since we lose data from buggy code, I really can't really justify a huge amount of engineering effort to avoid the chance of loss in DB failover.
IMO the main advantage for RethinkDB is its HA story. Last time I had to manage a PostgreSQL cluster (2012-2013) its HA story was pretty bad. It was limited to a master-slave(s) setup with manual failover and manual cluster rebuilding all dependent on incomplete 3rd party tools. Has PostgreSQL improved on this? A quick googling leads me to believe it hasn't and I'd only even consider it again if it were managed by a 3rd party (eg. aws rds).
If you want a managed solution RedShift supports PGSQL syntax and connectors now I think. That's a fully managed turn-key DB, but of course it locks you into AWS.
Redshift isn't highly available (if a node goes down you have to wait for it to restore a new one, which takes hours) and only has a small subset of PostgreSQL features. It is really only intended for OLAP (analytics) processing.
Not meaning to be an ass, but this is really bad advice.
Redshift is not supposed to be used like Postgres. Redshift is a data warehousing solution, with completely different tradeoffs, and accommodating completely different work loads than your average Postgres database. For example, you can't create indices on Redshift tables or relationships between them, the consistency story is completely different from Postgres and Redshift is optimized for bulk loads, not millions of discrete inserts/updates a second.
Amazon RDS does provide hosted Postgres, and that is what you want to use if you want managed Postgres. Or you can use Herokus's hosted Postgres (it does get expensive with size, though.)
Postgres and Redshift should not be considered equivalent. It's true that Redshift originated as a fork of Postgres 8, but Postgres has come a _long_ way since then and is quite different.
On top of that, Redshift is columnar database, which is an entirely different animal than vanilla Postgres (or any other DBMS).
It surely has improved (replication and failover are much easier now, than they used to be), but still not anywhere close to "just give other node's address and we're good". And that BDR thingy for master-master stuff is still an unofficial fork (although, as I get it, it's from one of main Postgres contributors, so it's really close)
It has improved quite a bit since then (in 9.2 or 9.3 when you last used it), but it is still not all the way there. Especially rebuilding the cluster is much easier now.
Excellent point! My counter-argument to that would be: How many startups/SMEs embracing Rethink really need full four or five nines availability?
I'd expect the costs of running that HA stuff is an order of magnitude higher than the costs of even a few days of downtime p.a.. And migrating from Postgres to something with HA once needed is probably easier than migrating to Postgres if costs are killing you.
What I didn't see mentioned here is clustering. One of the things that sold us on RethinkDB was how easy it was to cluster compared to PostgreSQL. The latter has poor documentation and it's very hard to know you've done things right... and if you don't the results can be catastrophic failures or mysterious replication problems with cryptic error messages.
Edit: also I was led to believe by PG documentation that LISTEN/NOTIFY is impossible across a cluster, which means that code depending on LISTEN/NOTIFY is impossible to cluster. If that's the case you're stuck with master/slave and manual or (scary) automatic failover now.
We wanted a system that is masterless (or all-master) in the sense that any node can fail at any time and the system doesn't care. RethinkDB delivers that, at least within the bounds of sane failure scenarios, and it delivers it without requiring a full time DBA to set up and maintain. That's worth a certain amount of CPU, disk, and RAM in exchange for stability and personnel costs, especially when a bare metal 32GB RAM SSD Xeon on OVH is <$200/month fully loaded with monitoring and SLA. So far we've been unable to throw a real world work load at those things that makes them do anything but yawn, and OVH has three data centers in France with private fiber between them allowing for a multi-DC Raft failover cluster. It's pretty sweet.
The only thing that would make me reconsider is if the use patterns of our data were really aggressively relational. In that case PGSQL would be a clear winner in terms of the performance of advanced relational operations and the expressivity of SQL for those operations. ReQL gives you some relational features on top of a document DB but it has limitations and is really designed for simpler relational use cases like basic joins.
"We wanted a system that is masterless (or all-master) in the sense that any node can fail at any time and the system doesn't care."
Neither PostgreSQL, nor MySQL can do much here and this is actually why we have the nosql movement. These problems are fundamentally unsolvable with the trade offs those RDBMSs made.
And that's why we have the "NewSQL" movement, though a bit slow. The benefit of NoSQL in some cases (Cassandra and HBase for me) is that it restricts developers from using non-scalable data structuring and querying. RDBMS's have a bad reputation primarily because how they were used via unrestricted queries that came back to haunt the product owners.
And even then, some local DB approaches are fundamentally unsolvable in a distributed way (CAP, exactly-once-sends, etc) without trading something, even with cockroach.
> MySQL Group Replication is a MySQL Server plugin that provides distributed state machine replication with strong coordination between servers. Servers coordinate themselves automatically, when they are part of the same replication group.
Great writeup! One of the issues I've run into with LISTEN/NOTIFY is the fact that it's not transaction safe. ie if you call NOTIFY and then encounter an error causing a rollback, you can't undo the NOTIFY.
I ended up building a system on top of PgQ (https://wiki.postgresql.org/wiki/SkyTools#PgQ) called Mikkoo (https://github.com/gmr/mikkoo#mikkoo) that uses RabbitMQ to talk with the distributed apps that needed to know about the transaction log. Might be helpful if you end up running into transactional issues with your use of LISTEN/NOTIFY.
NOTIFY is transaction-safe: if the NOTIFY-ing transaction rolls back, the NOTIFY will not be delivered. See the discussion here ("NOTIFY interacts with SQL transactions in some important ways..."):
Yup, my bad, thanks. I confused the two problems I worked through in working through my use case. The transaction safety one was not the NOTIFY version of the project. That had different issues, which I need to go back and look at my notes for. My memory is a bit hazy, but IIRC we were seeing notifications dropped under high velocities and needed delivery guarantees.
I seem to remember that when it was first implemented there was a fairly small buffer for queued messages. That might have contributed to your problem. The docs say it's 8GB by default now, which is probably sufficient.
Mainly because I've had way too much on my plate and have not had time to work on any of my open source projects if they're not directly related to something I'm doing at work.
That's good to hear, the fact that the solution itself does not have a problem, i got it compiling for 3.6.x and working.
I will try to look into stability (not crashing everything on disconnect) so any advice on how to go about that would be appreciated (email in profile).
The whole point of LISTEN/NOTIFY is that it's transaction safe! Not sure where you got that idea.
Not only is NOTIFY transactional, so is LISTEN.
The only serious limitation I can think of: There was some talk, long ago, about propagating notifications via the WAL, so that LISTEN could work on a read-only slave. I don't know if this was implemented.
My use case: I run a couple of stateful multiplayer games where I have a Node.js server writing to MongoDB on each player update (e.g. "update player.x to 123"). I only read from the DB when a player logs in and is added to the game or when the server restarts (items etc). As long as a player is online the state is just kept in in memory (a big array of all the players) but is written to the DB every time it's updated.
This means the game could theoretically work without a DB at all, the time it takes to write to the DB etc doesn't matter as long as it happens in the correct order. Read speed is also not really relevant since it happens so seldom.
The MongoDB document is the same as the player object in Node.js.
I've been thinking of migrating to RethinkDB but I've also been looking at PostgreSQL. Would the JSON support cover this sufficiently and would it make sense? I don't need any schemas or anything like that. I just want to be able to add and update JSON objects.
The biggest benefit of using an RDBMS is that it forces you to organise your data in a generalised way, and make it more available for queries that haven't been written yet. In your particular use case that doesn't sound very important, as you are basically just using it to store the game state.
Postgres's JSON types are really powerful, but using a document store is probably all you need, so I wouldn't recommend switching unless you've identified the tangible benefit.
One scenario that I can imagine is if you want to generate statistics across your game data. If you find yourself pulling lots of JSON from the database, and then doing loops, aggregations, or manual joins, within your application code, then switching out to a RDBMS could be a good idea. In that case, you could use postgres to store JSON much as you are doing with Mongo, but then have some in-database ETL to transform that data into a relational schema, which you could query in more powerful ways.
If you are already using Node.JS and need to merge/update JSON objects rather than them being overwritten, check out these couple of games made with a database designed for what you are talking about:
Thanks! How does it work? Is it just storing and updating a JSON file without any real DB? My game is running on Heroku (connecting to mLab) so I don't think that would work.
However, you are right, that will just run "in-memory" (and I think that URL is using an old version of the database).
But! No fear, there are adapters that connect to Amazon S3 (or others, you can connect to other storage engines, like Level - which in turn can connect to pretty much anything else). The S3 route is awesome because it means...
1. Multiple Heroku instances will still have persistent data across many reboots.
2. You don't have to worry about any Heroku plugin configuration / buildpack stuff.
3. S3 is ridiculously cheap, and you can even set it to auto-delete stuff after a month (in case you don't want to pay for long term storage).
Plus, there is no vendor lock in, so you can easily switch away later.
I appreciate the detailed analysis. A few comments:
> This post is probably going to make some people involved with RethinkDB very angry at me.
Actually, our community has always felt the opposite. Performance and scalability issues are considered bugs worth solving. That may have been the reaction of one or two community members, but that doesn't represent our values at all.
> A RethinkDB employee told me he thought I was their biggest user in terms of how hard I was pushing RethinkDB.
This may have been true (at the time) in terms of how SMC was using changefeeds, but RethinkDB is used in far more aggressive contexts. Here's a talk from Fidelity about how they used RethinkDB (for 25M customers across 25 nodes): https://www.youtube.com/watch?v=rm2zerSz6aE
SMC did seem to uncover a number of surprising bugs along the way: I would describe it as one of the more forward-thinking use cases that pushed the envelope of some of RethinkDB's newest features. This definitely came with lots of performance issues to solve along the way. I appreciate William’s tenacity and patience in helping us track down and fix these along the way.
> In particular, he pointed out this 2015 blog post, in which RethinkDB is consistently 5x-10x slower than MongoDB.
It’s worth pointing out that this particular blog post raised serious questions in its methodology, and recent versions of RethinkDB included very significant performance improvements: https://github.com/rethinkdb/rethinkdb/issues/4282
> Even then, the proxy nodes would often run at relatively high cpu usage. I never understood why.
I'd have to double-check with those who are far more familiar with RethinkDB's proxy mode, but it's because the nodes are parsing and processing queries as well, which can be CPU-intensive. They don't store any data, but if you use ReQL queries in a complex fashion (especially paired with changefeeds) it's going to require more CPU usage. We generally recommend that you run nodes with a lot of cores to take advantage of the parallelized architecture that RethinkDB has. This can get expensive if you aren't running dedicated hardware.
> The total disk space usage was an order of magnitude less (800GB versus 80GB).
> I imagine databases are similar. Using 10x more disk space means 10x more reading and writing to disk, and disk is (way more than) 10x slower than RAM…
This isn't necessarily true, especially with SSDs. RethinkDB's storage engine neatly divides its storage into extents that can be logically accessed in an efficient fashion. This is particularly valuable when running on SSDs, which are fundamentally parallelized devices. RethinkDB also caches data in memory as much as possible to avoid going to disk, but using more disk space doesn't immediately translate to lower performance.
One other interesting detail: since RethinkDB doesn’t have schemas, it stores the field names of each document individually. This is one of the trade-offs of not having a schema: even with compression, RethinkDB would use more space than Postgres for this reason. (This also impacts performance, since schemaless data is more complicated to parse and process.)
> Not listening to users is perhaps not the best approach to building quality software. [referring to microbenchmarks]
I think William may have misinterpreted the quote he describes from Slava’s post-mortem. Slava was referring to benchmarks that don’t affect the core performance of the database or production quality of the system, but may look better when you run micro-benchmarks: https://rethinkdb.com/blog/the-benchmark-youre-reading-is-pr...
We have always had an open development process on GitHub to collaboratively decide what features to build, and what their implementation should look like. I’m not certain what design choices William is suggesting we rejected. One has to only look at the proposal for dates and times in RethinkDB to see how this process and open conversation unfolds with our users: https://github.com/rethinkdb/rethinkdb/issues/977
> Really, what I love is the problems that RethinkDB solved, and where I believed RethinkDB could be 2-3 years from now if brilliant engineers like Daniel Mewes continued to work fulltime on the project.
RethinkDB development is proceeding after joining The Linux Foundation, despite the company shutdown. We believe that with a few years of work, RethinkDB will continue to mature as a database to reach Postgres’ level of stability and performance. We’re exploring options for funding dedicated developers long-term as an open-source project.
My thoughts: whatever technology you end up picking is going to have tradeoffs depending on your use case (and the maturity of the technology) and it's going to come with baggage. That's true of Postgres, MongoDB, RethinkDB, any programming language you choose, any tools you pick. If you're willing to carry that baggage it can be worth it: especially if it gives you developer velocity or if the problem you're solving is particularly well-suited to the tool.
Pick the technology that will have the least baggage for your problem. I often recommend Postgres to people, despite being one of the RethinkDB founders. Pragmatism wins over idealism, every time.
>It’s worth pointing out that this particular blog post raised serious questions in its methodology, and recent versions of RethinkDB included very significant performance improvements: https://github.com/rethinkdb/rethinkdb/issues/4282
I wouldn't even seriously consider that point - the article didn't even mention what version of MongoDB was being used. Safe mode writes could have been off, and he may have been just testing the latency between the client and database nodes. It's a pretty poor benchmark.
You can intentionally do calculations to 'break’ indexes and take them out of consideration. (E.g., pg will tend to prefer scanning a primary key index if you're ordering by that in the query. Sometimes this is a very poor choice if there's a more selective index)
There are also statistics that can be tweaked on vacuum analyze to improve the accuracy of the planner.
Potentially, there was an index which the author knew would yield good performance for a particular query, but the query planner over-estimated the number of rows, and so decided that a sequential scan would be a better choice. By executing the command, "SET enable_seqscan = OFF;" inside a session, you can force postgres to use an index if it possibly can.
These sort of problems can occur (for example) when very complicated joins are performed, or when unstructured data is being used as part of a join, as these tend to lead to poor estimates of the number of rows that the different parts of the query will yeild.
For additional anecdotes, I've had to hint MySQL's planner twice, both times to exclude an index. It'd switch because the cardinality of one index was much better, but there were enough pathological cases (and they were the most-often used) that the planner's choices were proving fatal.
It's not too abnormal. Sometimes you know the structure of your data better than the database's internal summary does, so you just give it a nudge in the right direction.
Another technique that gets used a fair amount to force certain planner behaviors are CTEs. I've used these to great effect in some rather complex queries where performance was critical.
There are different ways you can write queries that result in equivalent results. For instance you might try using inner joins, sub selects, lateral joins, etc.
We're preparing for an open source release of exactly that -- a document-oriented data store on top of Postgres and Elasticsearch, with transactions, joins, optional schemas, a query language, fine-grained document patches and change feeds. Email me if you want a ping when it's available.
Why Elasticsearch when Postgres has full-text search built in? It's one of my favorite Postgres features, I basically never have to bother with Elasticsearch/Solr/etc.
Good question. I hope to create a backend that actually stores data in Postgres, too.
We don't really use Elasticsearch because of the fulltext support, although we do use that, too.
Our document store is split into two parts: (1) A highly transactional data store (which uses Postgres) which stores master data and where everything is strict; and (2) an eventually-consistent search index (which uses Elasticsearch) where everything is expendable and queries are less strict.
This has the benefit of allowing extreme horizontal scalability on the read path, without impacting the performance of the write path, but at the cost of less consistency. By dividing the two, we can control the flow of data into the write path; for example, some clients do batch imports that are indexed more slowly than real-time updates.
The challenge with layering the search index on top of Postgres is how to represent the data. The document store manages the schemas for you, so we know to some extent what the data is, but also supports either partially or completely open schemas (where the allowed fields are either partially validated or completely schemaless).
We could build tables dynamically from schemas, or we could denormalize the data into a less efficient [id, field, type, value] table, or we could index JSONB documents (GIN, not as efficient as B-trees on normal columns afaik, and limited in some ways). There are a bunch of options.
It doesn't make sense to me to architect these kinds of applications at the database level.
What's wrong with using something like redis pubsub? I don't get the obsession of evented databases, or implementing this kind of thing at the database level. I suppose its attractive to "listen" to a table for changes but the pattern can be implemented elsewhere and with better tools.
Databases should be used for persistence, organization and schema of data, have flexible querying, and not much else.
I am explicitly not taking any side here, but that is certainly a matter of debate. Especially relational database management systems have all this functionality like views and triggers and stored procedures so that you can move a substantial part of your business logic into the database management system and only have very thin clients in order to minimize duplication of business logic across different clients. Treating a database management system as a dumb data store is only one possible philosophy, admittedly the one that seems to be the dominant one currently, at least as far as I can tell.
You can flip this argument and make a similar point:
"What's wrong with using something like Postgres LISTEN? I don't get the obsession of redis pubsub, or implementing this kind of thing outside the database. I suppose its attractive to "subscribe" to a collection for changes but the pattern can be implemented inside a database which has better tools."
In more constructive terms, databases are familiar to a lot of folks and have reliable guarantees (persistence, ACID) that are generally useful properties. If you're able to achieve the performance you need within the database, then you get a lot of operational benefits from keeping your workload running inside of it. If your workload for some reason is slow inside of a database, then it certainly makes sense to consider specialized alternatives (like Redis). In my experience, however, you can tune a database to perform better than these tools in pretty much every real-world case (i.e. moderate concurrency with realistic load).
Postgres LISTEN/NOTIFY is actually very similar to Redis pubsub. You can manually add triggers to send notification on changes, sure, but the real "in the database" stuff is in RethinkDB.
Aren't they both just the observer pattern? Seems like it falls into the same category as every other "should I implement this in the DB or app land" code.
Also, moving your callbacks to the db means you can call into the db from multiple code bases without fear of replicating the callbacks inconsistently.
That said, I'm not aware of many people who rely on that functionality.
The great thing with LISTEN/NOTIFY is they work inside transaction boundry.
This way you can get correct cache invalidation inside inside multiple table changes really easy. But if you cant use transactions that doesnt seem helpful.
You can have both. Postgres has foreign data wrappers, with Redis as a first class citizen. So, you can push your data to Postgres, and have it trigger to publish the updates. Databases have a pretty big toolset to make your life easier when you need to manipulate your data.
Code-level change data capture doesn't work well for e.g. many-row updates, one off queries... a ton of different stuff. That's not to say it doesn't have uses, but wanting global-DB-update monitoring is not one of those things where it fits cleanly
The author mentioned they're running in Google Cloud Engine. I'm curious; why not use RDS, which has Postgres support? (Especially considering that they are moving from a 3x-redundant setup to one with no redundancy)
>>> Weird. OK, I tried it with some other parameters, and it suddenly took 15 seconds at 100% CPU, with PostgreSQL doing some linear scan through data. Using EXPLAIN I found that with full production data the query planner was doing something idiotic in some cases. I learned how to impact the query planner, and then this query went back to taking only a few milliseconds for any input. With this one change to influence the query planner (to actually always use an index I had properly made), things became dramatically faster. Basically the load on the database server went from 100% to well under >>> 5%.
I am actually interested in this part. Figuring out issues with EXPLAIN is one of my favorite things.
I'm replying due to this same part/paragraph. I've been dealing with some manifestation of EXPLAIN since Oracle 6. In the last few years I've become handy with PostgreSQL's EXPLAIN as well, and this reminds me of my biggest hang-up regarding PostgreSQL; their hostility toward optimizer hints.
Like Mr. Stein I too have found myself in bad places with PostgreSQL's optimizer. This is commonplace with relational systems; every such system I've ever dealt with, including all versions of Oracle since the mid 90's, Informix, MS-SQL, DB/2 (on AS/400, Windows and Linux,) and PostgreSQL eventually get handed a query and a schema that produces the wrong plan and has intolerably bad performance. No exception. None of these attempts to create flawless optimizers that anticipate every use case has ever succeeded, PostgreSQL included.
With other systems there are hints that, as a last resort, you can apply to get efficient results. Not so much with PostgreSQL. Not implementing the sort of hints that solve these problems (as opposed to the often ineffectual enable_* planner configuration, unacceptable global configuration and other workarounds needed with PostgreSQL) is policy:
"We are not interested in implementing hints in the exact ways they are commonly implemented on other databases. Proposals based on 'because they've got them' will not be welcomed."
How about proposals based on "because your hint-free optimizer gets it wrong and I require a working solution without too many backflips and somersaults or database design lectures." No? Then sorry; I can't risk getting painted into a corner by your narrow minded and naive policy. PostgreSQL goes no further than non-critical, ancillary systems when I have say in it. And I do.
You're absolutely correct. Optimizer hints are a source of contention in Postgres. I'm sure you're already aware, but just for completeness, here's some more information on optimizer hints and Postgres from Simon Riggs back in 2011. I don't expect this to sway opinion one way or the other. It's useful to read more on Postgres' policy.
- Introducing hints is a common source of later problems, because fixing a query place once in a special case isn’t a very robust approach. As your data set grows, and possibly changes distribution as well, the idea you hinted toward when it was small can become an increasingly bad idea.
- Adding a useful hint interface would complicate the optimizer code, which is difficult enough to maintain as it is. Part of the reason PostgreSQL works as well as it does running queries is because feel-good code (“we can check off hinting on our vendor comparison feature list!”) that doesn’t actually pay for itself, in terms of making the database better enough to justify its continued maintenance, is rejected by policy. If it doesn’t work, it won’t get added. And when evaluated objectively, hints are on average a problem rather than a solution.
- The sort of problems that hints work can be optimizer bugs. The PostgreSQL community responds to true bugs in the optimizer faster than anyone else in the industry. Ask around and you don’t have to meet many PostgreSQL users before finding one who has reported a bug and watched it get fixed by the next day.
Now, the main completely valid response to finding out hints are missing, normally from DBAs who are used to them, is “well how do I handle an optimizer bug when I do run into it?” Like all tech work nowadays, there’s usually huge pressure to get the quickest possible fix when a bad query problem pops up....
Having never used postgres, this seems extremely scary. I've only ever needed index hints a couple times in other databases, but when you need them there isn't usually an alternative. Messing with global knobs is a good way to cause more problems than you are solving.
I've even entertained the idea that every query should be hinted. For OLTP workloads, you practically always know exactly how you want the DB to execute your query anyways. And often times you find out very late that the query planner made the wrong choice and now your query is taking orders of magnitude longer than it should (worse, sometimes this changes at runtime). I've never actually gone through with this religiously though...
Just to be clear, those settings are global per query, not global for the entire server. Still that makes them almost useless for large queries, but I would not exactly call them scary.
>> Having never used postgres, this seems extremely scary. I've only ever needed index hints a couple times in other databases, but when you need them there isn't usually an alternative. Messing with global knobs is a good way to cause more problems than you are solving.
You've got the plot exactly. The last such battle I was involved with ended in creating a materialized view to substitute for several tables in a larger join; without the view there was no way[1] to get an acceptable plan. Creating this view was effectively just a form of programming our own planner. And yes, the need to update the view to get the desired result is an ongoing problem; one that's scheduled to get solved with a migration to another DB.
Like you I've never been all that quick to employ hints. I tend to use them while experimenting during development or troubleshooting and avoid them in production code. But there have been production uses, and you know what? The world did not end. No one laughed at or fired me. No regulatory agency fined me. It did not get posted on Daily WTF. No subsequent maintenance programmer has ever shown up at my home in the dead of night. It just solved the problem, quickly and effectively.
Sure would be nice if people purporting to offer a fit-for-purpose relational systems understood the value of a little pragmatism.
[1] given the finite amount of time we could sacrifice to deal with it
I added code to my clients to do "SET enable_nestloop TO off" anytime they connect to the database. This sets a global flag for that session, which disables nestloop query planning. It could indeed impact other unrelated queries, which is deeply disturbing, but I don't know any way to disable nestloop query planning only for a specific query (aside from constantly setting and unsetting that flag?). Incidentally, here is an example of the query that causes all the trouble: "SELECT * FROM file_use WHERE project_id = any(select project_id from projects where users ? '25e2cae4-05c7-4c28-ae22-1e6d3d2e8bb3') ORDER BY last_edited DESC limit 100;" Search for nestloop in https://github.com/sagemathinc/smc/blob/master/src/smc-hub/p... to see my relevant client code.
It's a bit difficult to tell what exactly is going on. For future reference EXPLAIN ANALYZE provides a lot more information into the execution of the query. It tells you how how much time was spent in each part of the plan as well as how many rows were produced by each part.
From what I can tell, this query is getting the 100 last edited files of projects a user is part of. The way it is currently executing is by iterating through the most recently edited files, sees if the user belongs to the project of the file, and repeats until it finds 100 files of projects the user belongs to. Since the query returns no results, I'm you are running the query for a user that is not a part of any project, or of only empty projects. This means the query is looking up the projects of every single file only to find that none of them belong to a project the user was a part. You can check this by running EXPLAIN ANALYZE.
I'm not sure, since you didn't post the EXPLAIN of the query with enable_nestloop = off, but here's what I think is happening. You are getting a merge join between the projects table and the file_use table with the file_use_project_id_idx. If this is correct, this means Postgres first scans through all of the projects and finds the ones the user belongs to. Then it looks up all of the files that are part of one of those projects. Then it sorts those files by the time they were last edited and takes the top 100. I'm not sure if that is what's exactly happening, but I'm sure something similar to it is. You can check how accurate my guess is by running EXPLAIN/EXPLAIN ANALYZE.
The first thing I would try is creating a GIN index on the users field which can be done with the following:
CREATE INDEX ON projects USING GIN (users jsonb_ops).
What I would expect to see is a nested loop join between projects and files_used. The query should use the GIN index to find all projects the user belongs to. Then use the file_use_project_id_idx to get the files for each of the projects. Then sort the files by the last time they were edited and take the top 100.
I'm confused, the situation clearly called for a relational database. Almost any RDBMS would have been better than any "schemaless" db in this case. Was it just to get the reactive architecture features? I'm confused why you would architect an application from the database up.
Actually I could do it with SQL, key-value store, and even with document stores, not to mention Erlang's Mnesia (whichever category it falls into). I think without understanding the complete picture, write patterns, read patterns, number of open connections etc. it is hard to say it is definitely SQL.
Nothing was clear to me when I started writing the application. However, after years of running it in production, things did become much clearer, and indeed, a relational database is ideal.
I don't understand the approach, really. They manage a 25+ node cluster of RethinkDB, but are reluctant to introduce a message broker?
In my experience, a message broker isn't such a big operational burden, even more so if it doesn't have to persist messages. And a message broker with pub/sub doesn't make the architecture necessarily more complicated.
Somehow that seems like optimizing along the wrong axis.
I've always wondered how ReQL didn't look intuitive from this comparison and stayed away but I guess that wasn't a wrong assumption.
"Definitely, the act of writing queries in SQL was much faster for me than writing ReQL, despite me having used ReQL seriusly for over a year. There’s something really natural and powerful about SQL."
>Everything is an order of magnitude more efficient using PostgreSQL than it was with RethinkDB.
A large part of the sales pitch of "NoSQL" was that traditional RDBMSs couldn't handle "webscale" loads, whatever that meant.
Yet somehow, we continue to see PostgreSQL beating Mongo, Rethink, and other trendy "NoSQL" upstarts at performance, one of the primary advantages they're supposed to have over it.
Let's be frank. The only reason "NoSQL" exists at all is 20-something hipster programmers being too lazy to learn SQL (let alone relational theory), and ageism--not just against older programmers, but against older technology itself, no matter how fast, powerful, stable, and well-tested it may be.
After all, PostgreSQL is "old," having its roots in the Berkeley Ingress project three decades ago. Clearly, something hacked together by a cadre of OSX-using, JSON-slinging hipster programmers MUST be better, right? Nevermind that "NoSQL" itself is technically even older, with "NoSQL" systems like IBM's IMS dating back the 1960s: https://en.wikipedia.org/wiki/IBM_Information_Management_Sys...
My biggest problem with SQL for systems with web interfaces is that it forces you onto OLTP, when your workflow is just "select data, show to the user" -> "user saves changes, save them in the db". Googling for "asp.net deadlock" gives 160k results.
I probably have not worked enough with ASP.NET but I do not get that reference at all. In my experience deadlocks in databases are rare and do not happen on simple databases.
Most asp.net workshops rely on SQL Server, which up to version from 2012 did not support MVCC and relied on locking on writes and reads. The problem is, it had problems with web-kind load. If something was writing into a table X, and there would be a parallel query for table X join... (especially including index seek and not just lookups), you'd get into territory of heavy locking. Add ORM to that, which wouldn't let you influence the order of the joined tables and you've just got yourself a deadlock. This happened on a few occasions in a few different projects I've worked on. The real problem is: there is lot of technicalities you need to know and understand, to use it. Some NoSQL solutions (including MongoDb) try to abstract away those technicalities, and that was the point of my comment.
You might have a point about Postgres being faster than whatever NoSQL database, but there's really no need for all the name calling. Especially the "lazy" is very uncalled for in my opinion - there's just too much to learn, and too little time. If you're a 20-something (or whatever age, for that matter) and have to maintain the entire system of a single company, you're simply not going to be able to be a master of every part of the stack, or even know where you're making the wrong choices.
>Especially the "lazy" is very uncalled for in my opinion - there's just too much to learn, and too little time.
How is that a valid justification for reinventing the wheel, and poorly at that? In the end they're having to come around and learn SQL anyway, judging by all the Mongo hate and "why we switched to Postgres" posts I see here, so they gained nothing by avoiding it.
>or even know where you're making the wrong choices.
They could try actually listening to older developers for once and trusting established solutions like PostgreSQL instead of chasing after everything new and shinny that catches their eye.
> How is that a valid justification for reinventing the wheel, and poorly at that?
I'm not talking about the people reinventing the wheel, I'm talking about the people using the reinvented wheels and not realising that it's a reinvented wheel.
> try actually listening to older developers
This is harder than you make it sound. On the internet, it's hard to know who to listen to, and in real life, it can be hard to find older developers working on something relevant to you, and even then their skill levels differ. And then they should even want to bother with holding your hand while you're setting up your first database, making sure you don't botch up the performance by not doing things that come natural for someone who's been working with PostgreSQL for years and years.
If there's too much to learn and little time, you learn the stuff that computer science got right. Some abstractions are so good that you should prioritize learning them above any technology of the day:
- Relational algebra
- The network stack model (OSI or IP; it is irrelevant)
so if there isn't even enough time to learn practical but sub-optimal solutions, a dev should instead absorb books about IT, mathematics and programming theory?
great advice...unless any part of your equation includes receiving a paycheck.
Don't make it sound harder than it is. Each of these is 30-40 hours of work for undergraduate students. You don't have to quit your job and join a monastery to study compilers. It's interesting, challenging and fun for those of us into computing; and it pays back the time spent learning.
So now I have a list of things I should learn according to a random person on the internet, that no one in real-life ever gave me before. How am I to know that this list is the one? Why not every similar list that has ever been given to me?
And how am I going to develop the front-end of our new website after learning everything you named? There'll still be a lot left to learn before I can even hope to have produced something.
> The only reason "NoSQL" exists at all is 20-something hipster programmers being too lazy to learn SQL
For me as a 20-something swe who claims to be somehow proficient in SQL and NoSQL, I find it way more difficult to get the NoSQL datamodel right. With traditional SQL, you can basically do whatever you want with a model that looks however it wants. But if I have a, eg Cassandra model and a new requirement pops up, I cannot just do some joining and satisfy that.
A lot of that is true, but to be sure there are use cases for NoSQL--they are simply too often misunderstood or abused, being used when a traditional RDBMS would have worked fine. Elastic/horizontal scaling for certain types of data that are amenable to it (i.e. event logs, feeds, append only type stuff): its going to be hard to keep up with something like Cassandra there, where you can pretty much add servers on the fly to pick up load. In memory stuff for sure, for a distributed volatile in-memory key store. Simple flexibility you get going schema-less: if you have an embedded DB/on-premise application, you can't just get into your customer's environment and perform a migration; having a flexible schema can make patch updates easier.
"NoSQL" has it's use case just like traditional DBs do.
The problem is that most "NoSQL" provider try to sell you their DB for everything when a good old SQL DB might be better.
For example, I worked on a IoT project for an R&D company and Riak was definitely a better choice than a SQL database for our usecase : Easily distributed, handle large very large amount of small write and small amount of large read. It was also a necessity because we could generate several gigabyte per minutes and being able to add Riak node to add more storage space and bandwidth dynamically was very useful.
But yeah, SQL DBs are like a good toolbox : They can be used in most situation and will work fine. But "NoSQL" DBs are usually more of a dedicated tool for a specific use : It's better for this specific case but might not do well for other tasks.
Thanks... I came here to say this, and I actually said something similar yesterday and my comment got downvoted (which I couldn't care less), but to me the conclusion is, check your requirements and use the proper tool.
While there is some truth in this, I think you are missing the context.
NoSQL first became popular in 2008/09, when web traffic was exploding. In those days 500GB disks were the largest available and were very expensive in server form. CPUs were less powerful, and RAM was much smaller.
All this meant that many sites really were running into the limits a single server database. The most common solution back then was master slave MySQL replication, which had a whole set of problems of its own. Don't forget Postgres replication was pretty rudimentary, and back then MySQL really had a performance edge if you could compromise things like transactional integrity(!) - which many did.
Things like Hadoop solved the reliability issues with distributed MySQL, MongoDB (and CouchDB) tackled the horrible developer ergonomics and Redis tackled performance.
In that context they all made a lot of sense.
Now of course buying two big servers with a heap of RAM and storage and putting Postgres on them with replication is pretty easy.
But RAID had its own set of problems. Because RAID relies on multiple physical disks (often SAS for performance) there were serious limits on the amount you could get in. This was prior to SSDs being widespread of course.
Here's a review of a typical 1U server from 2008:
The X4150 can handle up to eight 2.5-inch SAS drives (mine has four 10K drives, 72GB each), 64GB of RAM (mine has 16GB), four Gigabit Ethernet interfaces, and three PCIe slots. This puts the X4150 ahead of the mainstream server pack, as the main contenders in this space generally offer a maximum of 32GB of RAM, and between four and six local disks.[1]
Assuming RAID, this had 144GB of storage, and even back then that was problematic. RAM was very small too, so keeping your DB in memory (or even a working set, or even the indexes) was difficult with a single server.
And of course, yes you could go to 2U or bigger. But back then server space was expensive.
Everything is a trade off of course. But my point is that in the context of 2008 hardware looking to do things differently made a lot of sense.
>Now of course buying two big servers with a heap of RAM and storage and putting Postgres on them with replication is pretty easy.
Is it really ? I really have no idea since I haven't played with PG in a couple of years but setting up clustering used to be a PITA with it so if it got better that sounds great. One of the impressive things I saw from RethinkDB is that you can set up sharing and replication with a few steps.
Oh I don't disagree at all btw., I've been doing MS SQL server and frontend professional recently so I haven't been in the loop about OSS DB, glad to hear things improved.
lol "lets be frank".... have you ever worked on a large scale project youself? how did running that on a singly homed postgres work out for you?
your comment seriously sounds like "I haven't ever seen anything with non-trivial scale yet, so I'm sure everybody working on it must be lazy/stupid/whatever."
and everbody is upvoting them. gotta love hacker news cargo cult.
I think the NoSQL popularity and appeal was largely due to the fit with the startup/rapid-iteration mentality. The lack of rigid schema promised faster product iteration with less friction compared to traditional RDBMS with strict schemas and typing.
Oh indeed, most RDBMSs offer great flexibility. They just tend to require a lot of configuration work. My impression of the NoSQL wave was "we don't know and we don't care; just throw it all in there and let search sort it out."
In fact, that mentality originated with Google (and it's entirely a bad way of doing things). Managing, categorizing, classifying, etc. in non-trivial cases may be an effort of relative futility, constantly chasing a moving target.
A simple key/value or document store just lets you start storing before you may actually know what you're storing or what the patterns and structures are (or even what you might want to do with the data later). With RDBMS, your designs can change quite a lot based on these requirements or intended uses.
It's fair to point out that young people will lack experience and understanding that older people have but there is such a thing as ageism in the opposite direction -- against younger people -- and your post comes across as such.
I am closer to 30 than 20 but I was closer to 20 than 30 when I was first introduced to NoSQL.
I enjoyed using PostgreSQL prior to NoSQL and I continue to enjoy PostgreSQL.
What got me exited about NoSQL when I first heard about it had nothing to do with SQL being "old". If I thought that "oldness" was a disadvantage -- which I don't, because it's not -- I wouldn't be using an operating system that has it's roots in the 70's and my workflow would not have been command-line-centric.
What got me exited about NoSQL was:
- The prospect of serialisation and deserialisation of data without having to rely on an Object-Relational Mapper. The relational model is great for the right kind of data, but the reason we have ORMs is that we are trying to cram data into a model that doesn't quite fit it. This has worked to varying degrees of success or failure. When NoSQL was introduced, PostgreSQL did not yet have native support for JSON, and having fields hold stringified JSON meant that you wouldn't be able to query that data using SQL, so having stringified JSON in a relational DB is not optimal. PostgreSQL has since gotten native support for JSON but here I am talking about when NoSQL was introduced.
- Map-Reduce. If it's a good fit for Google, I should investigate it to understand how to use it and to see if it had a use for me. Even if it would turn out to not bring any advantages, I would have learned it and I'd know to re-investigate it in the future if I needed something like it in order to scale a project that got a lot of traffic.
- Document-oriented instead of tables makes it possible to model data without having a strict schema.
Certainly someone with more experience might have been able to tell why one or more of these were actually not such a good reason, but MY POINT IS that I had actual reasons for being excited about NoSQL and none of my reasons had anything to do with hipsterism, ageism or anti-oldness to do. This is why I think your post is ageist, because rather than consider that we might have well-thought-out reasons to want to learn about NoSQL, you automatically jumped to the simple, overly broad, and frankly outright offensive, explanation that you did.
One important exception about NoSQL's raison d'être. If all you need is a key-value store without any fancy requirements then NoSQL^W pre-relational databases are perfectly good choices.
(Well, *dbm and memcachedb et al sorta pre-date that NoSQL-is-webscale database boom you're writing about.)
Not every web app is webscale. From all the evidence I saw SMC is tiny in webscale terms - 80GB of data or 600 open connections are not what NoSQL databases were meant to solve.
Using this anicdote to say that postgres can handle webscale is ludicrous.
> Yet somehow, we continue to see PostgreSQL beating Mongo, Rethink, and other trendy "NoSQL" upstarts at performance, one of the primary advantages they're supposed to have over it.
Yeah, pg will beat the pants off of pretty much any NOSQL DB on a single node, but what happens when you need master-master replication of sharded data in each datacenter as well as between datacenters?
Sure, denormalize and enforce data integrity at the application layer, shard the data, use queues for data updates to ensure every data-center gets all the data (eventually), and so on.
Now you're essentially treating the RDBMS as an eventually consistent NOSQL data store, horizontal scaling within a DC is still going to be annoying once you can't scale the shards vertically anymore, interruptions of network connectivity (or just increased latency) between DCs create huge headaches, and the (right) NOSQL DBs will beat the pants off of it for an equivalent hardware or hosting budget.
That said, NOSQL datastores solve problems that your startup will only encounter once you achieve product-market fit and enter a hypergrowth phase. Your MVP does not need an eventually-consistent distributed NOSQL backend, dammit.
Heck, your MVP may not even need PostGres or MySQL either. Just use SQLite, back up your server, and redeploy on pg only when (if) you start getting some traction.
So are there any actual success stories that the NoSQL movement can point at because it's bizarre to me how NoSQL can still be all the rage when time after time all I read is post mortems detailing painful experience after painful experience.
I'm at a stage where I haven't built enough of my current project to make moving back to an RDBMS painful yet, so all this stuff scares me.
I've built out several large scale features on DynamoDB and I can say that with a proper design it is really nice to only worry about budget when scaling. Just crank the provisioned read/write capacity up as needed.
I built https://github.com/llambda/agilegps on RethinkDB. The atomic changefeeds made it easy to have real-time updates of tracked vehicle's locations. Though it might be possible to do that with listen/notify in PostgreSQL too.
I am a bit puzzled about the scalability of listen/notify in Postgres and its use in the article. Each "listener" in the code requires a connection in the database so it's not a good design to have one listen "query" for each user. You will probably need a dedicated connection in a thread (or a limited number of connections) for the database listening functionality. You can possibly use some async PG driver but still on the database end I am not sure how efficient and scalable this solution will be.
I can assume that this is a good solution if you don't have (need) a high rate of "notify" statements and a high number of subscribers waiting on "listen". Any comments on these limits of PostgreSQL?
Good grief. Well this is a genuinely fascinating post containing a couple of absolutely terrifying insights:
"A RethinkDB employee told me he thought I was their biggest user in terms of how hard I was pushing RethinkDB."
Erm. If I were working for/a founder of a relatively small company using a product or service, especially one that's so critical to my own business, that is not the sort of thing I'd want to hear from the provider.
"Everything was a battle; even trying to do backups was really painful, and eventually we gave up on making proper full consistent backups (instead, backing up only the really important tables via complete JSON dumps)."
Holy crap.
Well, the story has a happy ending, and I think the point about the fundamental expressiveness of SQL is something that a lot of people miss in the mad dash to adopt "simpler" NoSQL solutions. I personally find SQL verbose and a bit ugly, but I still sort of love it because it's hugely powerful and expressive. I was perhaps 6 or 7 years into my career before I became comfortable with it, but I wish I'd thrown myself into learning it properly sooner because it is so incredibly useful.
Thanks! NOTIFY is much nicer to use today, since you can also include a payload, and there is a clean way to format data to include in the payload using all the new json functions. I understand that NOTIFY has also been sped up over the years...
1) It's rare to have enough insight into the internals of a particular datastore to accurately predict how it will perform on a particular workload. Whenever possible, early testing on production-scale workloads is essential for planning and proofs of concept.
2) Database capabilities are a moving target. E.g., the performance improvements to pgsql's LISTEN/NOTIFY are essential to its ability to handle this particular workload. In previous jobs, I've had coworkers cite failed experiences with 15-20 year-old databases as reasons for not considering them for new projects. Database tech has come a long way in that time.
3) Carefully-tuned RDBMSs are more capable than many tend to admit.
reply