DEV Community

Cover image for Designing the Right PostgreSQL Index Using Query Plans and Statistics
bertrand HARTWIG
bertrand HARTWIG

Posted on

Designing the Right PostgreSQL Index Using Query Plans and Statistics

PostgreSQL index design is often misunderstood.

Many developers think that creating a good index simply means:

“Create an index containing the columns from the WHERE clause.”

In reality, efficient index design is far more nuanced.

The order of columns inside a composite index matters enormously, and the best choice depends on:

  • Predicate types (=, >=, BETWEEN, LIKE, etc.)
  • Column selectivity
  • Table size
  • PostgreSQL planner statistics
  • Actual execution plans

This article explains the core principles behind efficient PostgreSQL index design before showing how pgAssistant automates this process using execution plans and database statistics.


Why Index Design Is Difficult

Consider the following query:

SELECT
    order_id,
    customer_id,
    employee_id,
    order_date,
    ship_country
FROM public.orders
WHERE customer_id = $1
  AND employee_id = $2
  AND order_date >= DATE $3;
Enter fullscreen mode Exit fullscreen mode

At first glance, several index definitions may appear reasonable:

(customer_id, employee_id, order_date)
Enter fullscreen mode Exit fullscreen mode
(order_date, customer_id, employee_id)
Enter fullscreen mode Exit fullscreen mode
(employee_id, customer_id, order_date)
Enter fullscreen mode Exit fullscreen mode

But these indexes do not behave the same way.

Choosing the correct ordering requires understanding how PostgreSQL traverses B-Tree indexes.


The Fundamental Rule of B-Tree Indexes

For PostgreSQL B-Tree indexes:

  1. Equality predicates should come first
  2. Range predicates should come last

This is the single most important rule in multi-column index design.


Equality Predicates Are Highly Selective

Predicates such as:

=
IN (...)
IS NULL
Enter fullscreen mode Exit fullscreen mode

allow PostgreSQL to navigate directly to a very precise section of the index tree.

In our query:

customer_id = $1
employee_id = $2
Enter fullscreen mode Exit fullscreen mode

are equality predicates.

If the index begins with these columns:

(customer_id, employee_id, ...)
Enter fullscreen mode Exit fullscreen mode

PostgreSQL can rapidly narrow the search space.

Conceptually:

customer_id = exact branch
employee_id = exact sub-branch
Enter fullscreen mode Exit fullscreen mode

The planner can jump almost directly to the matching rows.


Why Range Predicates Should Come Last

Now consider:

order_date >= DATE $3
Enter fullscreen mode Exit fullscreen mode

This is a range predicate.

Once PostgreSQL enters a range scan inside a B-Tree index, the remaining columns become far less useful for navigation.

For example, with this index:

(order_date, customer_id, employee_id)
Enter fullscreen mode Exit fullscreen mode

the planner must first scan all matching dates:

order_date >= DATE $3
Enter fullscreen mode Exit fullscreen mode

which may represent a very large portion of the table.

Only afterward can additional filtering occur.

This usually produces significantly more index scanning.

That is why range predicates are generally placed at the end of composite indexes.


Column Order Among Equality Predicates

Once equality predicates are identified, the next challenge is:

Which equality column should come first?

The answer depends on selectivity.

PostgreSQL exposes this information through planner statistics.


PostgreSQL Statistics Drive Good Index Design

For our query, PostgreSQL statistics are:

order_date [>=]:
    n_distinct=814
    null_frac=0.0000
    mcv_count=100
    histogram_bounds=101

customer_id [=]:
    n_distinct=89
    null_frac=0.0000
    mcv_count=89
    histogram_bounds=0

employee_id [=]:
    n_distinct=9
    null_frac=0.0000
    mcv_count=9
    histogram_bounds=0
Enter fullscreen mode Exit fullscreen mode

The most important metric here is:

n_distinct
Enter fullscreen mode Exit fullscreen mode

Understanding n_distinct

n_distinct estimates the number of distinct values in a column.

Higher n_distinct usually means:

  • higher selectivity
  • fewer matching rows
  • better filtering efficiency

In our example:

Column n_distinct
customer_id 89
employee_id 9

customer_id is significantly more selective.

Therefore, PostgreSQL benefits more from filtering by customer_id first.


Why Selectivity Matters

Imagine the table contains 1 million rows.

Filtering by:

employee_id
Enter fullscreen mode Exit fullscreen mode

may still leave:

1,000,000 / 9 ≈ 111,111 rows
Enter fullscreen mode Exit fullscreen mode

Filtering by:

customer_id
Enter fullscreen mode Exit fullscreen mode

may reduce the result to:

1,000,000 / 89 ≈ 11,236 rows
Enter fullscreen mode Exit fullscreen mode

Starting with the most selective equality predicate drastically reduces the search space.

This improves:

  • index scan efficiency
  • cache locality
  • heap access reduction
  • execution time

The Correct Index Design

Applying these principles:

  1. Equality predicates first
  2. Most selective equality columns first
  3. Range predicates last

produces:

CREATE INDEX CONCURRENTLY
    pga_idx_orders_customer_id_employee_id_order_date
ON public.orders
    (customer_id, employee_id, order_date);
Enter fullscreen mode Exit fullscreen mode

This ordering allows PostgreSQL to:

  1. Navigate efficiently using exact matches
  2. Reduce scanned rows as early as possible
  3. Apply the range scan only after narrowing the search space

Good Index Design Also Depends on Table Size

One of the biggest misconceptions about PostgreSQL optimization is:

“Indexes are always faster.”

This is false.

For small tables, PostgreSQL often prefers a Sequential Scan (Seq Scan) even when an index exists.

Why?

Because using an index has overhead:

  • traversing the B-Tree
  • reading index pages
  • performing heap lookups
  • random I/O access

For sufficiently small tables, scanning the entire table sequentially is cheaper.


Query Plans Matter More Than Theory

A theoretically perfect index is useless if PostgreSQL never uses it.

That is why index recommendation engines should never rely only on SQL syntax.

They must also inspect:

  • execution plans
  • estimated costs
  • table statistics
  • row estimates
  • planner decisions

The Importance of Execution Plans

The execution plan reveals how PostgreSQL actually executes a query.

For example:

EXPLAIN ANALYZE
SELECT ...
Enter fullscreen mode Exit fullscreen mode

may show:

Seq Scan on orders
Enter fullscreen mode Exit fullscreen mode

or:

Index Scan using ...
Enter fullscreen mode Exit fullscreen mode

This distinction is critical.

A query may contain filter predicates that look index-friendly, but PostgreSQL may correctly determine that:

  • the table is too small
  • selectivity is too low
  • too many rows would still be scanned

and therefore prefer a sequential scan.


How pgAssistant Recommends Indexes

pgAssistant does not simply parse SQL queries.

It combines multiple sources of information:

1. Query Plan

pgAssistant analyzes nodes in the query plan to identify candidate index columns.


2. Predicate Types

It classifies predicates into categories:

Equality predicates

=
IN
IS NULL
Enter fullscreen mode Exit fullscreen mode

Range predicates

>
>=
<
<=
BETWEEN
LIKE 'prefix%'
Enter fullscreen mode Exit fullscreen mode

Equality predicates are prioritized before range predicates.


3. Column Statistics

pgAssistant uses PostgreSQL planner statistics such as:

  • n_distinct
  • null_frac
  • most_common_vals
  • most_common_freqs
  • histogram_bounds

to estimate column selectivity.

Columns with higher selectivity are prioritized earlier in the index definition.


4. Table Statistics

pgAssistant also evaluates table-level statistics, including:

  • estimated row counts
  • table size
  • planner cost estimates

This is extremely important because some tables are simply too small to justify an index.

In these cases, recommending an index would create unnecessary maintenance overhead without improving performance.


How pgAssistant Recommends PostgreSQL Indexes

Why Query Syntax Alone Is Not Enough

A good recommendation depends on:

  • predicate types
  • column selectivity
  • table statistics
  • planner estimates
  • execution plans
  • existing indexes already used by PostgreSQL

This is precisely the approach implemented by pgAssistant.


pgAssistant Uses Execution Plans First

pgAssistant starts from the PostgreSQL execution plan.

It analyzes:

EXPLAIN (ANALYZE, FORMAT JSON)
Enter fullscreen mode Exit fullscreen mode

This is extremely important because the execution plan reveals:

  • whether PostgreSQL uses a Seq Scan
  • whether an Index Scan already exists
  • whether residual filtering still occurs after index access
  • whether planner row estimations are inaccurate
  • whether a composite index could reduce heap filtering

This avoids many false-positive recommendations.

A query may look index-friendly while PostgreSQL is already using the optimal access path.


pgAssistant Analyzes Existing Access Paths

The advisor first determines how PostgreSQL currently accesses the table.

Examples:

Seq Scan
Index Scan
Index Only Scan
Bitmap Heap Scan
Enter fullscreen mode Exit fullscreen mode

This distinction is critical.

Sequential Scan Case

If PostgreSQL performs a Seq Scan, pgAssistant evaluates whether an index could realistically improve performance.

Indexed Access Case

If PostgreSQL already uses an index, pgAssistant does not stop there.

It also analyzes:

  • Index Cond
  • Filter
  • Recheck Cond
  • Rows Removed by Filter

This allows pgAssistant to detect situations such as:

Index Scan using idx_customer on orders
  Index Cond: (customer_id = 42)
  Filter: (employee_id = 5)
Enter fullscreen mode Exit fullscreen mode

In this case, PostgreSQL uses an index, but still visits many rows that are later discarded by the executor.

pgAssistant can therefore recommend a more selective composite index such as:

(customer_id, employee_id)
Enter fullscreen mode Exit fullscreen mode

instead of considering the existing index “good enough”.


Predicate Classification

Once predicates are extracted from the execution plan, pgAssistant classifies them by operator type.

Internally, predicates are ranked according to B-Tree efficiency.

Equality Predicates

Highest priority:

=
IN
IS NULL
Enter fullscreen mode Exit fullscreen mode

These predicates allow PostgreSQL to navigate directly to a very small portion of the index tree.


Prefix Search Predicates

Second priority:

LIKE 'abc%'
Enter fullscreen mode Exit fullscreen mode

Prefix searches can still benefit efficiently from B-Tree traversal.


Range Predicates

Lowest priority:

>
>=
<
<=
BETWEEN
Enter fullscreen mode Exit fullscreen mode

Once PostgreSQL enters a range scan, subsequent columns become far less effective for index navigation.

That is why range predicates are typically placed last in composite indexes.


How pgAssistant Orders Index Columns

After classifying predicates, pgAssistant computes candidate index ordering.

The internal ordering logic is:

1. Equality predicates first
2. Prefix predicates second
3. Range predicates last
4. Inside each category:
   highest cardinality first
Enter fullscreen mode Exit fullscreen mode

This logic is implemented directly inside:

reorder_index_candidate_columns()
Enter fullscreen mode Exit fullscreen mode

The advisor therefore builds indexes that align with PostgreSQL B-Tree traversal behavior.


Why Column Cardinality Matters

pgAssistant uses PostgreSQL statistics to estimate selectivity.

The most important metric is:

n_distinct
Enter fullscreen mode Exit fullscreen mode

which estimates the number of distinct values in a column.

Higher cardinality usually means:

  • fewer matching rows
  • better filtering
  • smaller index scan ranges

For our example:

customer_id [=]: n_distinct=89
employee_id [=]: n_distinct=9
order_date [>=]: n_distinct=814
Enter fullscreen mode Exit fullscreen mode

Although order_date has the highest cardinality, it is a range predicate and therefore placed last.

Among equality predicates:

customer_id > employee_id
Enter fullscreen mode Exit fullscreen mode

because:

89 > 9
Enter fullscreen mode Exit fullscreen mode

The final ordering becomes:

(customer_id, employee_id, order_date)
Enter fullscreen mode Exit fullscreen mode

pgAssistant Uses PostgreSQL Statistics

pgAssistant enriches every recommendation using planner statistics extracted from PostgreSQL.

Examples include:

  • n_distinct
  • null_frac
  • most_common_vals
  • most_common_freqs
  • histogram_bounds

The advisor even exposes these statistics in its recommendation reasoning.

Example:

customer_id [=]: n_distinct=89, null_frac=0.0000
employee_id [=]: n_distinct=9, null_frac=0.0000
order_date [>=]: n_distinct=814
Enter fullscreen mode Exit fullscreen mode

This makes the recommendation transparent and explainable.


pgAssistant Also Uses Table Statistics

An index is not always beneficial.

This is one of the most important concepts in PostgreSQL optimization.

For small tables, PostgreSQL often correctly prefers:

Seq Scan
Enter fullscreen mode Exit fullscreen mode

instead of:

Index Scan
Enter fullscreen mode Exit fullscreen mode

because:

  • sequential reads are cheap
  • index traversal has overhead
  • heap fetches introduce random I/O
  • scanning the entire table may cost less

This is why pgAssistant also evaluates:

  • estimated row counts
  • table size
  • planner costs
  • execution frequency
  • workload intensity

The advisor does not blindly recommend indexes whenever a sequential scan appears.


Detecting Inefficient Indexed Access

One particularly powerful aspect of pgAssistant is its ability to analyze residual filtering.

For indexed scans, the advisor evaluates:

Rows Removed by Filter
Enter fullscreen mode Exit fullscreen mode

If PostgreSQL retrieves many tuples from the index only to discard them afterward, pgAssistant detects that the current index may be incomplete.

Internally, the advisor computes:

Residual filter kept X% of tuples visited
Enter fullscreen mode Exit fullscreen mode

This helps identify situations where adding an additional column to a composite index could drastically reduce heap filtering.


Detecting Planner Estimation Problems

pgAssistant also compares:

plan_rows
vs
actual_rows
Enter fullscreen mode Exit fullscreen mode

Large estimation gaps may indicate:

  • stale statistics
  • data skew
  • correlation issues
  • missing extended statistics

This additional analysis improves the reliability of recommendations.


The pgAssistant Recommendation Algorithm

Conceptually, pgAssistant follows this workflow:

1. Analyze EXPLAIN ANALYZE JSON plan
2. Detect access paths
3. Extract predicates from:
   - Index Cond
   - Filter
   - Recheck Cond
4. Classify predicates by operator type
5. Rank predicates for B-Tree efficiency
6. Use PostgreSQL statistics to estimate selectivity
7. Order columns by:
   - predicate class
   - cardinality
8. Evaluate table statistics
9. Evaluate execution costs
10. Detect residual filtering
11. Compare against existing indexes
12. Recommend index only if beneficial
Enter fullscreen mode Exit fullscreen mode

Example: Final Recommendation

Given:

SELECT
    order_id,
    customer_id,
    employee_id,
    order_date,
    ship_country
FROM public.orders
WHERE customer_id = $1
  AND employee_id = $2
  AND order_date >= DATE $3;
Enter fullscreen mode Exit fullscreen mode

pgAssistant evaluates:

Column Predicate Priority n_distinct
customer_id = Equality 89
employee_id = Equality 9
order_date >= Range 814

The advisor therefore generates:

CREATE INDEX CONCURRENTLY
    "pga_idx_orders_customer_id_employee_id_order_date"
ON "public"."orders"
    ("customer_id", "employee_id", "order_date");
Enter fullscreen mode Exit fullscreen mode

This ordering follows PostgreSQL B-Tree optimization principles while also considering:

  • planner statistics
  • table characteristics
  • execution plan behavior
  • residual filtering
  • existing indexes already in use

Live demo

A public demo is available here:

https://ov-004f8b.infomaniak.ch/

Demo connection:

postgresql://postgres:demo@demo-db:5432/northwind

The public demo intentionally runs without AI.

Project links

Feedback welcome

The project is still evolving and many parts can certainly be improved.

If you work with PostgreSQL and have ideas, feedback, or criticisms, feel free to open an issue or discussion on GitHub.

Thanks for reading.

Top comments (0)