Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> More knowledge and options are always better.

That's not even true in theory. Given a simple problem A, when adding more options, at some point, choosing among the options requires more effort than solving the simple problem, if only by brute force. There's a reason why RDBMS sometimes skip indexes.

And in practice, GP is right: with humans, too many options eventually only distract from the problem.



sort by: page size:

[delayed]

> And in practice, there's usually only two or three really big tables involved in performance-sensitive queries, tables which you need to pay attention that you don't end up doing scans on.

The problem is that those few big tables are often critical to most of the queries and so each query has to carefully use the right hints or query order if the planner isn't doing much.

It almost makes me wonder if indexes themselves should get hints or at least priorities to help the planner order operations.


> Honestly, SQL has this problem too, but it presents itself not in the way you store, but in the way you query. There are simple schemas and complex ones, and irrespective of that there are obvious query sets and unplanned ones (i.e. written at runtime as part of the data analysis process). SQL and its autoplanning is required only for complex+unplanned cases, in my opinion. In all other cases I know my data and I’d better walk through the indexes myself rather than writing 4-story queries to satisfy the planner. At the end of the day, nested loops through the indexes is what RDBMS does. There is no declarative magic at the fetch-and-iterate level.

The thing is that what worked at specific time can change. For example if you have simple join with two tables, let say A and B. You search by column in target A to get value from column in table B. Now if both tables are large then it makes sense to lookup in A by an index, then use foreign key and index to find the row in table B.

Now if A and B have few elements. Even if there is an index on both of them, it actually is faster just to scan one or both tables.

It might be actually more beneficial to ensure that tables are properly analyzed, have right indices and preferences in the query planner are tuned.

If you need to override query planner, you don't have to make sophisticated queries, you can just use this[1] extension. Though if things aren't working right it is either lack of data, mis-configuration or a bug.

[1] http://pghintplan.osdn.jp/pg_hint_plan.html


> Indexes are such an easy thing to add.

Indexes can have catastrophic consequences depending on the access patterns of your table. I use indexes very sparingly and only when I know for sure the benefit will outweigh the cost.


>> Yeah. But the whole discussion here was about the dilemma the optimizer faces if it only has the two indexes on (created_at) and (organization_id), and why the assumption of independence/uniformity does not work for the skewed case. > > Was it? The whole discussion started when zac23or said at the top of the thread that their biggest problem with PostgreSQL is its optimizer, that this is their worst example, that it needs an index on (organization_id, created_id) for it to be solved, and that SQLite and MS SQL Server do not need that index on (organization_id, created_at). ...

I don't know, but that's how I understood the discussion - I tried to explain why the Postgres optimizer makes this choice, and why planning this type of queries (LIMIT + WHERE) is a hard problem in principle.

>> Of course, adding a composite or partial index may help, but people are confused why the optimizer is not smart enough to just switch to the other index, which on my machine does this: > > I'm curious how you forced the optimizer to switch to that query plan if it's not smart enough to do it on its own.

I simply dropped the index on created_at, forcing the planner to use the other index.

> I think a third question, which is implicit in your comments is, "Well then why doesn't the optimizer use one plan for organization_id=10 and the other plan for the other values of organization_id?" That's a good question. Is this possible? Does any other database do this? I don't know. I'll try to find out but if somebody already has the answer, I'd love to hear it.

The answer is that this does not happen because the optimizer does not have any concept of dependence between the two columns, or (perhaps more importantly) between values in these two columns. So it can't say that for some departments it's not correlated but for ID 10 it is (and should use a different plan).

Maybe it would be possible to implement some new multi-column statistics. I don't think any of the existing optional extended statistics could provide this, but maybe some form of multi-dimensional histogram could? Or something like that? But ultimately, this can never be a complete fix - there'll always be some type of cross-column dependency the statistics can't detect.

> A fourth question is, "How do other databases perform on this and similar queries?"

I'm not sure what all the other databases do, but I think handling this better will require some measure of "risk" or "plan robustness". In other words, some ability to say - how sensitive is the plan to estimates not being perfectly accurate, or maybe some other assumptions being violated. But that's gonna be "better worst case, worse average" solution.


> adding naive indexes

You get 90% of the improvement just from that.

Indexes are the whole reason anyone even uses databases. And yet some backend guys think they are optional.


> Like you write a simple looking query and it turns out to have O^2 complexity

Or you have a simple fast query with a lovely plan until the database engine decides that because you now have 75 records in the middle table instead of 74, the indexes are suddenly made of lava and now you're table-scanning the big tables and your plan looks like an eldritch horror.

[Not looking at MySQL in particular, oh no.]


> There's nothing you can do in an index for a table with 350 rows

I meant if you have 100s of indices -- it might be slow even on small tables.

I actually had a production issues from using a lot of indices, but it's not apples-to-apples with current discussion, because the table sizes, DBMS and update rates were much larger, fixed by removing indices and splitting table into multiple.


> the argument would be that you don't bear the cost for that for most queries

Your INSERT becomes even more expensive because you have 2 update 2 indexes, and your lookup doesn't change.

I get the benefit if you use the id as a FK or if the database must have data on disk in PK order like MySQL or MSSQL (?). On PostgreSQL if you just insert & read on that key only it seems objectively worse.


> but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal.

Is this really a big deal? Surely for a SQL database, looking up the existence of a few IDs, just about the simplest possible set operation, is an operation measured in a fraction of a millisecond. Wikipedia is not that sparse, and a lookup of a few IDs in a call would easily bound it to a worst case of <10 calls and still just a millisecond or two.


> adding hints makes your queries dependent on current performance

I don't understand this, can you explain.

> Assuming you're a good DBA, you can be better than the optimizer because you understand the whole system

That's inaccurate. As a DBA I can understand the system better at higher level, but the database has statistics which gives it typically better understanding of the data distribution at a lower level. Indeed I could get that information and feed it into the query plan via hints, but that's going to be an enormous amount of my time, and I would have to do it every time the query is run whereas the database can keep an eye on the statistics as it varies and rebuild the query over time.

Equally, if an index is added you can expect the database to start using it immediately without adding hints. Etc

I am a fairly(?) skilled DBA who has a reasonable idea of what goes on underneath the hood, I do have some idea what I'm talking about.


> Sorry, no. The original schema was correct, and the new one is a mistake.

Well, you also save a space by doing this (though presumably you only need to index the emails as IPs are already 128 bits).

But other than that, I'm also not sure why the original schema was bad.

If you were to build individual indexes on all four rows, you would essentially build the four id tables implicitly.

You can calculate the intersection of hits on all four indexs that match your query to get your result. This is linear in the number of hits across all four indexes in the worst case but if you are careful about which index you look at first, you will probably be a lot more efficient, e.g. (from, ip, to, helo).

Even with a multi index on the new schema, how do you search faster than this?


> aimed at data engineers

I disagree. Professional developers should know their database of choice inside and out, and use-the-index-luke helps with that. You can skip the details about databases that aren't relevant to you.


>It takes a single query and suggests indexes for it. A good one would take a mix of queries and suggest a set of indexes, also considering the impact on write speed of additional indexes (table updates often need to update indexes, too)

This is my pet peeve with SQL Server SSMS will give you a missing index suggestion and cost... the problem is inexperienced people will take the suggestion as is and create way too many highly specialized indexes over time.


> how does the query engine know

It doesn't, really.

When a query planner is deciding the best path for answering a query, it will generate many different plans with different costs. Estimations are sometimes wrong, so planners may try several different plans for the same query over time and eventually settle on the one that actually did the best.

Databases are also constantly collecting statistics on the values in each column, such as min/max, distribution, cardinality, rows per page/block, values at different percentiles. It's possible for these statistics to change over time, either suddenly or slowly, and this can mean the cached plan is sub-optimal. Or maybe you're adding/removing indexes. Plans become stale, and new plans will need to be tested.

So let's say you create an index on `createdAt`. When you ask for rows `WHERE createdAt >= '2020-01-01' AND createdAt < '2020-04-01'`, it's going to generate plans considering that index and compare it against a plan without that index.

Because the index is sorted and we have a range query (min/max), the planner knows it can very quickly find the inner b-tree nodes that contain all rows that match the query and how many pages may need to be read from disk and how many rows that will include. It will then know exactly what data pages we need to scan to find the rows we're interested in.

Without that index, it has no idea about the distribution of values of `createdAt`. It's very possible that 99% of all rows are sorted, but for whatever reason 1 row is completely out of place (should be row #100, but is row #10,000). Every row will need to be scanned to be sure.

Even with the index, the database statistics include min/max values. Let's say we change our query to `WHERE createdAt >= '1900-01-01' AND createdAt < '2100-01-01'`, and it includes EVERY row in the table. The query planner will be able to figure this out, and will generate a less costly plan that just does a full table scan instead and skips the index.


>It's annoying for sure but a smart editor could simply offer you all possible columns in the current context

I'm using databases with >500 of tables

I guess it's not that trivial


>Database optimization needs to generally take into account YOUR data and usage patterns.

I'd say this is quite true.

>I've found that an unfortunate number of people aren't aware of "EXPLAIN" and its use to help figure out query issues. Learn and use it.

Yes, we discovered this some years ago.

Thanks for you input.


> A database with no indexes is slow.

No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.

Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.

But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.


> Without benchmarking it, my first instinct would still be to design the schema in a way to avoid recursive queries, if possible.

In my experience that's a good idea. The query planner has a really hard time with recursive queries.

next

Legal | privacy