By Pavel Tiunov, Statsbot
A single query optimization tip can boost your database performance by 100x. At one point, we advised one of our customers that had a 10TB database to use a date-based multi-column index. As a result, their date range query sped up by 112x. In this post, we share five simple but still powerful tips for PostgreSQL query optimization.
Although we usually advise our customers to use these tips to optimize analytic queries (such as aggregation ones), this post is still very helpful for any other type of query.
To keep it easy, we ran examples for this article on a test dataset. Although it doesn’t show the actual performance improvement, you will see that our tips solve the significant set of optimization problems and work well in real-world case scenarios.
Postgres has a cool extension to the well-known
EXPLAIN command, which is called
EXPLAIN ANALYZE. The difference is that
EXPLAIN shows you query cost based on collected statistics about your database, and
EXPLAIN ANALYZE actually runs it to show the processed time for every stage.
We highly recommend you use
EXPLAIN ANALYZE because there are a lot of cases when
EXPLAIN shows a higher query cost, while the time to execute is actually less and vice versa. The most important thing is that the EXPLAIN command will help you to understand if a specific index is used and how.
The ability to see indexes is the first step to learning PostgreSQL query optimization.
One index per query
EXPLAIN ANALYZE SELECT line_items.product_id, SUM(line_items.price) FROM line_items WHERE product_id > 80 GROUP BY 1
An explain analyze returns:
This query scans all of the line items to find a product with an id that is greater than 80, and then sums up all the values grouped by that product id.
Now we’ll add the index to this table:
We created a B-tree index, which contains only one column:
product_id. After reading lots of articles about the benefits of using index, one can expect a query boost from such an operation. Sorry, bad news.
As we need to sum up the price column in the query above, we still need to scan the original table. Depending on the table statistics, Postgres will choose to scan the original table instead of the index. The thing is, index lacks a
We can tweak this index by adding a price column as follows:
If we rerun the explain plan, we’ll see our index is the forth line:
But how would putting the price column first affect the PostgreSQL query optimization?
Column order in a multicolumn index
Well, we figured out that a multicolumn index is used in the previous query because we included both columns. The interesting thing is that we can use another order for these columns while defining the index:
If we rerun explain analyze, we’ll see that
items_product_id_price_reversed is not used. That’s because this index is sorted firstly on
price and then on
product_id. Using this index will lead to a full index scan, which is nearly equivalent to scanning the whole table. That’s why Postgres opts to use scan for an original table.
It’s a good practice to put in the first place columns, which you use in filters with the biggest number of unique values.
Filters + joins
It’s time to figure out what the best set of indexes is for a specific join query, which also has some filter conditions. Usually, you can achieve optimal results by trial and error.
As in the case of simple filtering, choose the most restrictive filtering condition and add an index for it.
Let’s consider an example:
Here we have join on
order_id and filter on
created_at. This way, we can create a multicolumn index that will contain
created_at in the first place,
order_id in the second and
price in the third:
We’ll get the following explain plan:
As you can see,
line_items_created_at_order_id_price is used to reduce scan by date condition. After that, it’s joined with orders using the
orders_pkey index scan.
Date columns are usually one of the best candidates for the first column in a multicolumn index as it reduces scanning throughput in a predictable manner.
Our tips for PostgreSQL query optimization will help you to speed up queries 10-100x for multi-GB databases. They can solve most of your performance bottlenecks in an 80/20 manner. It doesn’t mean you shouldn’t double check your queries with
EXPLAIN for real-world case scenarios.
Original. Reposted with permission.
- Scalable Select of Random Rows in SQL
- SQL Window Functions Tutorial for Business Analysis
- Calculating Customer Lifetime Value: SQL Example