About the usefulness of indexes on expressions

learning sessions on PostgreSQL, and advanced and basic courses, I often face the fact that students know practically nothing about how powerful can be the index expressions (if they even know about their existence). So let me make You a small review.

Suppose we have a table with a range of timestamp's (Yes, we have a function generate_series, which can generate dates):

the
CREATE TABLE t AS
SELECT d, repeat(md5(d::text), 10) AS padding
FROM generate_series(timestamp '1900-01-01',
timestamp '2100-01-01',
interval '1 day') s(d);
VACUUM ANALYZE t;

Also, the table contains a column padding, so she was a little bit more. Now, let's perform a simple query by range, returning only one month out of approximately 200 years, available in the table. If you run this query explain'ω, we get approximately the following:

the
EXPLAIN SELECT * FROM t WHERE d BETWEEN '2001-01-01' AND '2001-02-01';

QUERY PLAN
------------------------------------------------------------------------
Seq Scan on t (cost=0.00..4416.75 rows=32 width=332)
Filter: ((d >= '2001-01-01 00:00:00'::timestamp without time zone)
AND (d <= '2001-02-01 00:00:00'::timestamp without time zone))
(2 rows)

and on my computer, the query runs about 20 milliseconds. Not bad, considering the fact that he should pass over the entire table, consisting of 75 thousand lines.

But let's create an index on the column timestamp'ohms (all indices in this text baseline, i.e. the btree, if not explicitly specified):

the
CREATE INDEX idx_t_d ON t (d);

And now let's try to run the query again:

the
 the QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx_t_d on t (cost=0.29..9.97 rows=34 width=332)
Index Cond: ((d >= '2001-01-01 00:00:00'::timestamp without time zone)
AND (d <= '2001-02-01 00:00:00'::timestamp without time zone))
(2 rows)

It now runs in 0.5 milliseconds, roughly 40 times faster. But it was, of course, a simple index created directly on the column, no index on the expression. So let's assume that instead of the last request, we will request data for each of the first day of each month by executing the following query:

the
SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

which, however, cannot use the index because it needs to evaluate the expression on the column, while the index built on the column, as shown using EXPLAIN ANALYZE:

the
 the QUERY PLAN
------------------------------------------------------------------------
Seq Scan on t (cost=0.00..4416.75 rows=365 width=332)
(actual time=0.045..40.601 rows=2401 loops=1)
Filter: (date_part('day'::text, d) = '1'::double precision)
Rows Removed by Filter: 70649
Planning time: 0.209 ms
Execution time: 43.018 ms
(5 rows)

So he not only has to perform a sequential scan, but count the number of increasing the query time to 43 milliseconds.

The database cannot use indexes for several reasons. Indexes (btree at least) that are based on queries against the sorted data given a tree structure, and if the first request range and benefits from it, the second (call extract).

Note: Another problem is that the set of operators supported by the index (i.e. which can be performed directly on the index itself) is very limited. The extract function is not supported, so the query cannot circumvent the problem of sorting using a Bitmap Index Scan.

Theoretically, the database can try to convert the condition to a set of conditions, but it is extremely difficult and is specific to each expression. In this case, we have to generate an infinite number of such ranges "for the day", because the scheduler actually does not know the minimum/maximum timestamp's in the table. So that the base does not even try.

But while the database does not know how to convert the condition, developers usually know. For example the condition:

the
(column + 1) >= 1000

is not hard to rewrite in the following way:


and it already will work fine with indexes.

But what if such a transformation is impossible, as is the case for our query:

the
SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

In this case, the developer will be faced with the same problem unknown min/max values for column d, and even then he will have to generate lots of ranges.

Well, this article is about the index expressions, we used only conventional indexes up to this point, built directly on the column. Let's create the first index on the expression:

the
CREATE INDEX idx_t_expr ON t ((extract(day FROM d)));
ANALYZE t;

the result of which we get the following plan:

the
 the QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=47.35..3305.25 rows=2459 width=332)
(actual time=12.539..2.400 rows=2401 loops=1)
Recheck Cond: (date_part('day'::text, d) = '1'::double precision)
Heap Blocks: exact=2401
-> Bitmap Index Scan on idx_t_expr (cost=0.00..46.73 rows=2459 width=0)
(actual time=1.243..1.243 rows=2401 loops=1)
Index Cond: (date_part('day'::text, d) = '1'::double precision)
Planning time: 0.374 ms
Execution time: 17.136 ms
(7 rows)

Yet he does not give the same increase in speed is 40 times as the index of the first sample, this is quite expected since the query returns a lot more tuples (2401 vs 32). Moreover, they are spread all over the table and not so localized as in the first example. So this is a good acceleration in 2 times, and in many real life situations, you will see a much greater increase.

But the ability to use indexes for conditions with complex expressions in core — it's not the most interesting information in this article is the reason why people create indexes on expressions. But this is not the only advantage.

If you look at the two query execution plan below (with and without index on expression), you notice the following:

the
 the QUERY PLAN
------------------------------------------------------------------------
Seq Scan on t (cost=0.00..4416.75 rows=365 width=332)
(actual time=0.045..40.601 rows=2401 loops=1)
...

the
 the QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=47.35..3305.25 rows=2459 width=332)
(actual time=12.539..2.400 rows=2401 loops=1)
...

True — create index on expression significantly improved the assessment. Without an index, we have only statistics (MCV + histogram) for rough columns of the table, and the base does not know how to evaluate the expression

the
EXTRACT(day FROM d) = 1

Accordingly, it applies the equality comparison by default, which returns approximately 0.5% of all rows because the table has 73050 lines, we get only 365 rows. Often you can see a much larger estimation error in a real application.

With the index, however, the database also collected statistics on the columns of the index and, in this case, the column contains the result of the expression. In the planning process, the optimizer draws attention to it and gives a much better rating.

This is a great advantage that can help to deal with problems of poor query execution plans caused by inaccurate estimates. However, most people do not know about such a convenient tool.

And the usefulness of this tool has increased with the presentation of JSONB data type in version 9.4, because it's practically the only way to gather statistics about the contents of JSONB documents.

When using JSONB documents, there are two basic indexing strategy. You can create a GIN/GIST index on the entire document such as the following:

the
CREATE INDEX ON t USING GIN (jsonb_column);

which allows you to query arbitrary parts JSONB field, use the operator to compare content of documents, etc. that's fine, but You do anyway there is only basic statistics on the column, which is not very convenient, as the documents are maintained as scalars (and none of them coincides with the whole document, or uses a range of documents).
Indexes on expressions, for example, is created as follows:

the
CREATE INDEX ON t ((jsonb_column->'id'));

will be useful only for the specific expression for this particular example:

the
SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

but not for requests to other JSON keys, for example, the value:

the
SELECT * FROM t WHERE jsonb_column ->> 'value' = 'xxxx';

This does not mean that GIN/GIST indexes are useless on the whole document, but You will have to choose. You can create aimed at specific index, useful when the query goes to a specific key and with the added benefit of statistical data on the expression. Or You create a GIN/GIST index on the entire document, able to cope with requests to arbitrary keys, but without statistics.

However, You can kill both birds, in this case, because You can create both indexes at the same time, and the database itself to choose which one to use for which requests. And You will have accurate statistics due to index expressions.

Unfortunately, the indices in the expression and GIN/GIST indexes use different conditions:

the
-- expression (btree)
SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

-- GIN/GiST
SELECT * FROM t WHERE jsonb_column @ > '{"id" : 123}';

so the scheduler cannot use them at the same time — indexes on expression to be evaluated and GIN/GIST indexes to execute.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Performance comparison of hierarchical models, Django and PostgreSQL

Transport Tycoon Deluxe / Emscripten part 2

google life search