Scaling (horizontal + vertical) and HA are the biggest weaknesses of Postgres. It's modern in features and flexibility but still struggles with core infrastructure issues and performance.
It doesnt scale vertically either. Postgres is single-threaded meaning it can't make use of multiple CPU cores on the same machine. There have been some slow improvements to this and 9.6 seems to hint at some parallel aggregation changes but overall Postgres is strong in features but weak in scaling (in any direction).
Horizontal scaling is still challenging in Postgres. That’s where “NoSQL” systems still have their place. Or one of the SQL systems layered on to a NoSQL architecture, with the trade offs that entails.
Postgres neither scales vertically or horizontally.
v9.6 finally brought some parallel scans but is still limited in scope and far behind the commercial databases. Same with replication and failover, although there are decent 3rd party extensions to get it working.
That postgres is behind in scaling (multi-core, replication, sharding, multi-master) against all modern relational databases is verifiable truth. Your judgement is clouded by your proximity to the project. Here's just 1 prior thread between us: https://news.ycombinator.com/item?id=13841496
-> Enable 1-line command to setup basic replication (whole node or per db level) without having to edit anything else. Look at redis or memsql for inspiration.
-> Enable 1-line command to easily failover, remove hotstandby requirements.
-> Add DDL support to logical replication because I guarantee this will be the #1 source of problems for anyone that tries to use this.
-> Fix the poor defaults in all the configs.
-> Document the existence of pg_conftool and make it a first-class tool that can also edit pg_hba.conf automatically so I no longer have to deal with any config files. Store any other settings in the database cluster itself. Inspiration from sql server on linux or cockroachdb here.
IT/admin UX at this point is more important than the diminishing core upgrades. Postgres is good enough internally but has sorely fallen behind on usability with obsolete defaults, convoluted configs and way too many 3rd party tools needed for basic scaling. Saying use X tool is effectively the same as just use Y database that already does it. MySQL, MariaDB, even SQL Server are not standing still.
Solid in-cluster and cross-dc replication is the reason we stuck with memsql (+ built-in columnstore), not because it was possible but because it actually works easily and reliably with their fantastic memsql ops program. I've shared all of this feedback online over the years (along with many others) so the lack of understanding is really more a reflection of the postgres team and poor priority planning.
In my experience more often than not, Postgres performance problems aren't really caused by the database, but either badly designed schemas or queries. For a lot of developers, the thinking goes that 10s of millions of rows sounds like a lot like big data, so they must start building microservices, distributed systems, use K/V stores and horizontally scale a la Google, whereas their entire dataset could actually fit in RAM of a single server and they could avoid the majority of the complexity if they just scaled vertically, which is usually much, much easier.
How far can one get these days with vertical scaling of Postgres?
I dont know how well the engine could make use of the extra resources.
If it scales well vertically, I expect it to cover 80% - 95% of
use cases for people using Postgres.
Probably with less complexity, and less overhead than scaling horziontally
I'm curious what you mean by Postgres not having a "scale-up" strategy. Are you saying that eg MS SQL works better with a few TBs of ram and 128 cores than Postgres does?
But you don't need to horizontally scale if you use a beefy Postgres instance. Horizontal scaling is a problem created by using an inefficient DB in the first place.
I am not sure I agree with the general idea that Postgres can't or even--albeit a bit less strongly--that it is hard to scale. Even in 2008 people were running petabyte-scale warehouses using Postgres:
Since 2008 improvements in parallel query execution (and numerous other improvements) in the core project plus the availability of forks/extensions which abstract and/or modify various bits for improving scalability (see Citus and Timescale) it's never been easier to scale Postgres to some truly staggering heights.
While I wouldn't want to speak in absolutes, there are very few applications where I think Postgres wouldn't be a viable choice as a data warehouse.
Emphasis on warehouse as I wouldn't want to suggest Postgres as an ideal candidate to be a data lake. The difference between them for me being whether or not the data is structured/processed. Similar in definition to this article:
Personally, I have experience scaling core PostgreSQL (9.4) to handle ingestion of monitoring data for web servers to the tune of 2-3 terabytes a day. Not the grandest of scales, but enough to have seen a few bumps along the way...and, for what it's worth, I think it is surprisingly easy to scale.
I wouldn't want to sign up to scale Postgres to handle exabyte data loads, but single digit petabytes? Sure.
Have you used postgres for a work queue before? I have, and it did not scale even for smaller workloads. It's possible there are new primitives that make it more efficient, but back in 2014/2015 it didn't work.
Postgres isn't designed from the ground up for high availability or scalability. Most of the in-built functionality for setting up HA wasn't introduced until 9.0, and the process of setting up HA postgres with automatic failover requires third-party tools. As for scalability, postgres can only scale vertically.
If you need a solution that can scale writes to more than than one node, or a solution that has first party support for HA with automatic failover, you shouldn't be using postgres.
As an aside, I find the dogmatic "just use postgres for everything" choir just as bad as the marketing BS associated with noSQL databases.
It's really interesting, that Postgres is extremely efficient, but same time can't scale vertically well enough, when there is synchronous replication. Unlike its distributed counterparts.
reply