DEV Community

Cover image for PostgreSQL SQL Cheat Sheet — Clause Order, Joins, Aggregates, Windows
Gowtham Potureddi
Gowtham Potureddi

Posted on

PostgreSQL SQL Cheat Sheet — Clause Order, Joins, Aggregates, Windows

A PostgreSQL SQL cheat sheet is only useful when every row in it maps to something you can drop straight into a query — not a wall of syntax with no operational explanation. This guide condenses real PostgreSQL fluency to four primitives: the logical clause order (FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT), the six join shapes and the grain trap they create, GROUP BY with HAVING plus conditional aggregates for one-pass metrics, and window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD for ranking and lookback. These four cover the bulk of analytical SQL — and the cheat-sheet style below is built so you can scan, copy a snippet, and tweak it for your own schema.

Every section walks through a detailed topic explanation, sub-topics with worked examples and runnable solutions, common beginner mistakes, and a worked interview-style scenario with a full traced answer. PostgreSQL syntax throughout — the dialect that drives DataLemur, CoderPad, most product-analytics live screens, and the bulk of modern data-engineering SQL corpora.

Bold PipeCode blog header for the PostgreSQL SQL cheat sheet with the elephant mascot and colored SQL keywords SELECT, FROM, JOIN, WHERE, WINDOW on a dark gradient background.


Top PostgreSQL SQL cheat sheet topics

The four numbered sections below follow this topic map — one row per H2, every row expanded into a full section with sub-topics, worked examples, a worked interview question, and a step-by-step traced solution:

# Topic Why it shows up in a PostgreSQL cheat sheet
1 Logical clause order — FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT The single most useful PostgreSQL mental model: the order you write clauses is not the order the engine evaluates them; knowing the evaluation order eliminates 80% of parse errors and explains why WHERE cannot reference aggregates or column aliases.
2 Joins and grain — INNER, LEFT, RIGHT, FULL, SELF, CROSS Joins combine rows but they also change grain; a careless 1:N join inflates row counts silently, and the LEFT JOIN ... IS NULL anti-join is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users).
3 GROUP BY, HAVING, and conditional aggregates WHERE filters rows before grouping; HAVING filters groups after; COUNT(*) FILTER (WHERE …) and SUM(CASE WHEN …) express many metrics in one query — the universal duplicate finder HAVING COUNT(*) > 1 lives here.
4 Window functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD Per-partition ranking without collapsing rows, top-N-per-group, second-highest salary, running totals with SUM(...) OVER (PARTITION BY ... ORDER BY ...), and month-over-month deltas via LAG; the most-graded primitive in modern SQL screens.

Beginner-friendly framing: every analytical SQL question reduces to four steps — filter rows, join tables without changing grain by accident, aggregate or rank, then present the result. Holding the clause-order diagram in your head (Section 1) lets you write SQL outside-in: pick the grain, then the joins, then the filters, then the projection. The cheat sheet below is organized in the same order you would write a real query.


1. PostgreSQL Logical Clause Order — FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT

The seven-stage evaluation order every PostgreSQL query follows

"Why does WHERE customer_count > 5 give me a parse error when I'm clearly counting customers?" is the signature beginner question — and the answer is logical clause order. The mental model: PostgreSQL evaluates clauses in a fixed order that is different from the order you write them; FROM/JOIN builds the row set, WHERE filters rows, GROUP BY collapses rows into groups, HAVING filters groups, SELECT projects columns, ORDER BY sorts, LIMIT/OFFSET trims. WHERE cannot reference aggregate functions because aggregates do not exist until after GROUP BY; column aliases declared in SELECT cannot be referenced in WHERE for the same reason.

Horizontal seven-step PostgreSQL clause-order diagram from FROM/JOIN through WHERE, GROUP BY, HAVING, SELECT, ORDER BY, to LIMIT/OFFSET with purple and orange brand icons for each stage and pipecode.ai attribution.

Pro tip: Memorize one sentence — "From-Where-Group-Having-Select-Order-Limit" — and you can decode any PostgreSQL parse error in under five seconds. The error column "customer_count" does not exist almost always means the column is a SELECT-level alias being referenced in WHERE, which runs three stages earlier; lift the predicate into HAVING (if it references an aggregate) or repeat the expression inline in WHERE.

FROM and JOIN — build the working row set

The FROM/JOIN invariant: the first stage assembles a candidate row set by listing the tables (and how they join); every subsequent stage operates on this row set. Subqueries in FROM are also evaluated here, and LATERAL joins let later subqueries reference earlier rows.

  • Single tableFROM orders produces one row per orders row.
  • Joined tablesFROM orders o JOIN customers c ON c.id = o.customer_id produces one row per matching pair.
  • Subquery in FROMFROM (SELECT ...) t materializes the inner result, then treats it as a table.
  • LATERAL subqueryFROM orders o, LATERAL (SELECT ... WHERE x = o.id) s re-evaluates the inner subquery per outer row.

Worked example. A FROM with a LEFT JOIN that produces the right row set before any filter runs.

step output cardinality
FROM customers alone 3 rows
LEFT JOIN orders 4 rows (Alice has 2 orders, Bob 1, Carol 0 padded with NULLs)
ready for WHERE filtering 4 rows

Step-by-step explanation.

  1. The engine reads customers first, producing three rows (Alice, Bob, Carol).
  2. For each customer, it scans orders for matching customer_id rows; Alice matches 2 orders, Bob matches 1, Carol matches 0.
  3. Because the join is LEFT, Carol's row is preserved with the right-side columns filled with NULLs — total 4 rows.
  4. This 4-row stream is what WHERE will see; no filtering has happened yet.
  5. Without understanding FROM runs first, you can't reason about why a WHERE predicate on the right side of a LEFT JOIN silently converts the join into an INNER JOIN.

Worked-example solution.

SELECT c.name, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if your LEFT JOIN is producing fewer rows than expected, check whether you have a WHERE predicate that references the right-side table — that predicate runs after the join and discards the NULL-padded rows.

WHERE — row-level predicates before grouping

The WHERE invariant: WHERE filters individual rows from the FROM/JOIN output before GROUP BY runs; it can reference any column from the joined row set, but cannot reference aggregate functions or SELECT-level aliases. This is the cheapest place to drop rows — push predicates here whenever possible.

  • Row predicatesWHERE amount > 30, WHERE order_date >= '2026-01-01'.
  • IN / EXISTSWHERE customer_id IN (SELECT id FROM premium), WHERE EXISTS (...).
  • BETWEEN — inclusive on both ends; WHERE x BETWEEN 1 AND 10 is x >= 1 AND x <= 10.
  • IS NULL / IS NOT NULL — the only way to check for NULL; never = NULL.

Worked example. Filter to one day of orders before grouping.

filter rows surviving
no filter 12,847 (full day's orders)
WHERE order_date = '2026-05-10' 12,847 (already today's)
WHERE order_date = '2026-05-10' AND amount > 100 4,290 (high-value only)

Step-by-step explanation.

  1. FROM orders returns the full row stream.
  2. WHERE order_date = '2026-05-10' is evaluated per row; rows with other dates are dropped.
  3. AND amount > 100 is evaluated next; this is a row predicate (not an aggregate), so it lives in WHERE correctly.
  4. The surviving row set (4,290 rows) flows into GROUP BY if one is present, otherwise into SELECT.
  5. Pushing the date filter into WHERE rather than HAVING is critical for index usage: a B-tree index on order_date can prune 95% of the table before any grouping happens.

Worked-example solution.

SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE order_date = '2026-05-10'
  AND amount > 100
GROUP BY customer_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the predicate uses only raw row columns, it belongs in WHERE; if it uses SUM, COUNT, AVG, MIN, MAX, it belongs in HAVING.

GROUP BYHAVINGSELECTORDER BYLIMIT

The downstream invariant: after WHERE, the engine evaluates GROUP BY (collapsing rows into one row per distinct key combination), then HAVING (filtering groups), then SELECT (projecting columns and computing expressions), then ORDER BY (sorting the final result), then LIMIT/OFFSET (trimming for pagination). SELECT-level aliases become referenceable only in ORDER BY and the outer query (in a subquery context).

  • GROUP BY col1, col2 — one output row per distinct (col1, col2) combination.
  • HAVING agg_pred — filter groups; can reference COUNT(*), SUM(col), etc.
  • SELECT col, agg(col2) AS x — project columns; aggregates and aliases are computed here.
  • ORDER BY x DESC, col — can reference SELECT aliases; deterministic with a tiebreaker.
  • LIMIT N OFFSET M — page slicing; always pair with ORDER BY for determinism.

Worked example. Group by customer, filter to high-spend customers, sort descending, top 5.

stage rows
FROM orders WHERE order_date = '2026-05-10' 4,290
GROUP BY customer_id 1,720 (one row per customer)
HAVING SUM(amount) > 500 312 (high-spend)
SELECT customer_id, SUM(amount) AS spend 312 (projected)
ORDER BY spend DESC, customer_id 312 (sorted)
LIMIT 5 5

Step-by-step explanation.

  1. WHERE produces 4,290 rows for one day with amount > 100.
  2. GROUP BY customer_id collapses them into 1,720 buckets, one per customer.
  3. HAVING SUM(amount) > 500 keeps only the 312 buckets whose total spend exceeds $500.
  4. SELECT computes the alias spend = SUM(amount) and projects two columns.
  5. ORDER BY spend DESC, customer_id sorts the 312 surviving rows by descending spend with a deterministic tiebreaker; LIMIT 5 returns just the top five.

Worked-example solution.

SELECT customer_id, SUM(amount) AS spend
FROM orders
WHERE order_date = '2026-05-10'
  AND amount > 100
GROUP BY customer_id
HAVING SUM(amount) > 500
ORDER BY spend DESC, customer_id
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every clause has a fixed slot; if you find yourself wanting WHERE to reference an aggregate, the predicate belongs in HAVING instead — and if you want ORDER BY to use a long expression, alias it in SELECT and reference the alias.

Common beginner mistakes

  • WHERE COUNT(*) > 1 — parse error; aggregates do not exist until after GROUP BY. Use HAVING.
  • Referencing a SELECT alias in WHEREWHERE spend > 100 after SELECT SUM(amount) AS spend fails; either repeat the expression or move to HAVING.
  • Selecting a non-aggregated, non-GROUP BY column — strict PostgreSQL errors out with "must appear in GROUP BY"; some other dialects pick an arbitrary row silently.
  • LIMIT 5 without ORDER BY — non-deterministic; two runs of the same query return different rows.
  • Putting HAVING before GROUP BY — syntax error; the clause order is mandatory.

PostgreSQL Interview Question on Clause Order

Given orders(order_id, customer_id, amount, order_date), find every customer who placed more than 3 orders today with total spend above $500. Return customer_id and total_spend, sorted by total_spend descending.

Solution Using WHERE + GROUP BY + HAVING in the Right Slots

SELECT customer_id,
       SUM(amount) AS total_spend
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY customer_id
HAVING COUNT(*) > 3
   AND SUM(amount) > 500
ORDER BY total_spend DESC, customer_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: WHERE order_date = CURRENT_DATE filters to today's row set first (cheap, index-friendly); GROUP BY customer_id collapses to one row per customer; HAVING evaluates the two aggregate predicates together (more than 3 orders AND total > $500); SELECT projects the alias total_spend; ORDER BY total_spend DESC, customer_id produces a deterministic ordering. Single pass over today's rows with hash aggregation.

Step-by-step trace for sample data on 2026-05-10:

customer_id orders today sum(amount) passes HAVING?
101 5 720
102 4 410 ✗ (sum ≤ 500)
103 6 1,250
104 2 800 ✗ (count ≤ 3)
105 4 520

Three customers survive both predicates.

Output:

customer_id total_spend
103 1250
101 720
105 520

Why this works — concept by concept:

  • WHERE firstorder_date = CURRENT_DATE is a row predicate using a non-aggregated column; pushing it into WHERE shrinks the row set before grouping and lets the planner use a B-tree index on order_date.
  • GROUP BY customer_id — collapses today's rows into one bucket per customer; every subsequent aggregate is computed inside this bucket.
  • HAVING two-predicate ANDCOUNT(*) > 3 and SUM(amount) > 500 are both aggregate predicates; combining them with AND in a single HAVING is the canonical multi-condition group filter.
  • SELECT projection + aliasSUM(amount) AS total_spend is computed here; the alias becomes available to ORDER BY (but not to WHERE / HAVING).
  • ORDER BY total_spend DESC, customer_id — descending sort on the metric with a deterministic tiebreaker via customer_id; reviewers depend on stable ordering.
  • O(|today's orders| + G log G) time — single hash aggregation produces G groups; final sort is G log G. With an index on (order_date, customer_id) the planner can stream rather than hash.

Inline CTA: Drill the SQL filtering practice page for WHERE patterns and the SQL aggregation practice page for GROUP BY + HAVING shapes.

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →


2. PostgreSQL Joins and Grain — INNER, LEFT, RIGHT, FULL, SELF, CROSS

Joins, anti-joins, and the grain-inflation trap in PostgreSQL

"Why is SUM(amount) returning double what I expect after I add a JOIN?" is the signature grain-inflation question — and the answer is that joins do not just combine columns; they change the row cardinality of the result. The mental model: INNER JOIN keeps only matching pairs, LEFT JOIN keeps every left row and pads the right side with NULLs, RIGHT JOIN is the mirror, FULL OUTER JOIN keeps both sides' unmatched rows, SELF JOIN joins a table to itself (for hierarchies and pair queries), CROSS JOIN produces a Cartesian product (one row per (left, right) pair). The cardinality of any join is bounded by |left| × |right|, and a 1:N relationship inflates left rows by N — the silent source of doubled metrics.

Venn diagrams for INNER JOIN (purple intersection of Table A and Table B) and LEFT JOIN (green Table A with a NULL pocket where Table B does not match) under a PostgreSQL SQL Cheat Sheet headline with a grain/cardinality footer label.

Pro tip: Before writing any join, ask "what is the grain of the result?" — orders, order lines, customer-day, or (customer, product) pair. A 1:N join (e.g., customers to orders) inflates customer rows by the number of orders; SUM(customer.lifetime_value) after that join returns lifetime value × order count, not lifetime value. Always state the grain out loud.

INNER JOIN — keep only matching pairs (no padding)

The INNER JOIN invariant: a left row is paired with a right row iff the join predicate is TRUE; unmatched rows on either side are discarded; the result cardinality is the count of matching pairs. This is the most common join and the fastest because the planner can short-circuit on no-match.

  • ON l.key = r.key — single-column equi-join; the planner hashes the right table.
  • Multi-columnON l.a = r.a AND l.b = r.b for composite keys.
  • Non-equiON l.range_start <= r.point AND l.range_end >= r.point (range join).
  • USING (col) — shorthand when both sides share the column name; merges the column.

Worked example. Two tables, three customers, two orders; one customer has no order.

customer order_id
Alice 101
Bob 102

Carol (no orders) does not appear — INNER JOIN dropped her.

Step-by-step explanation.

  1. The engine reads customers (Alice, Bob, Carol) and orders (101 for Alice, 102 for Bob).
  2. For each customers row, it scans orders for a matching customer_id.
  3. Alice matches order_id = 101; Bob matches order_id = 102; Carol has no match.
  4. Carol's row is silently discarded because the join is INNER — no NULL-padded row is produced.
  5. The output has two rows because there were two matching pairs; the result cardinality is min(|customers|, |orders|) ≤ N ≤ |customers| × |orders|.

Worked-example solution.

SELECT c.name AS customer, o.order_id
FROM customers c
INNER JOIN orders o
  ON o.customer_id = c.id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: reach for INNER JOIN whenever the question is "rows where both sides exist"; it is the smallest, fastest, most common join.

LEFT JOIN — keep every left row, pad the right with NULLs (anti-join trick)

The LEFT JOIN invariant: every row from the left table appears in the output; if no right row matches, the right columns are NULL; LEFT JOIN ... WHERE right.key IS NULL keeps exactly the left rows that had no match — the anti-join idiom. RIGHT JOIN is the mirror; flip the table order and use LEFT for consistency.

  • LEFT JOIN — preserves every left row.
  • Right columns NULL when no match — the key signal for anti-joins.
  • LEFT JOIN ... IS NULL anti-join — "find rows in A with no match in B".
  • RIGHT JOIN — mirror image; rarely needed (just flip table order and use LEFT).

Worked example. Same customers + orders; Carol is preserved with NULL right-side columns.

customer order_id
Alice 101
Bob 102
Carol NULL

Step-by-step explanation.

  1. For each customers row, scan orders for a matching customer_id.
  2. Alice matches → row (Alice, 101); Bob matches → row (Bob, 102).
  3. Carol does not match → row (Carol, NULL) is produced because the join is LEFT.
  4. To find Carol via the anti-join: add WHERE o.order_id IS NULL after the LEFT JOIN; only Carol's row passes the filter.
  5. Equivalent to WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id) and (under NOT NULL constraints) WHERE c.id NOT IN (SELECT customer_id FROM orders) — but the anti-join is immune to the NOT IN NULL-swallowing bug.

Worked-example solution.

SELECT c.name AS customer
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "find X with no Y" → LEFT JOIN ... WHERE Y.id IS NULL. Memorize this; it is the most-asked join shape in SQL interviews and the cleanest fix for the NOT IN ... NULL trap.

FULL OUTER, SELF, and CROSS joins — the rarer shapes

The rarer-joins invariant: FULL OUTER JOIN keeps every left row AND every right row (with NULL padding on the unmatched side); SELF JOIN joins a table to itself by aliasing it twice (employees-and-managers, parent-child, pair queries); CROSS JOIN produces every (left, right) combination — the Cartesian product. Each has a narrow but important use.

  • FULL OUTER JOIN — reconcile two sources; rows from either side without a match get padded.
  • SELF JOIN — employee/manager, hierarchical recursion (alternative to recursive CTE), pair queries.
  • CROSS JOIN — generate every combination (small tables only) or paired with LATERAL for top-N per row.
  • Implicit cross join — comma-separated tables (FROM a, b) without an ON is a CROSS JOIN — usually a bug.

Worked example. Self-join employees to itself to surface each person's manager.

name manager_name
Alice Carol
Bob Carol
Carol NULL (CEO)

Step-by-step explanation.

  1. Alias the same employees table twice: e (for employees) and m (for managers).
  2. LEFT JOIN on e.manager_id = m.emp_id looks each employee up against the manager rows.
  3. Alice's manager_id points to Carol → row (Alice, Carol).
  4. Bob's manager_id points to Carol → row (Bob, Carol).
  5. Carol is the CEO so her manager_id IS NULL → no match → row (Carol, NULL) because the join is LEFT.

Worked-example solution.

SELECT e.name,
       m.name AS manager_name
FROM employees e
LEFT JOIN employees m
  ON m.emp_id = e.manager_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: SELF JOIN is one-level hierarchy; for arbitrary-depth recursion (org chart traversal, BOM tree), reach for WITH RECURSIVE instead.

Common beginner mistakes

  • Forgetting that a 1:N JOIN inflates the left side — SUM(left.col) returns left.col × N.
  • Filtering the right table inside WHERE after a LEFT JOIN (e.g., WHERE o.amount > 0) — silently turns the LEFT JOIN into an INNER JOIN because NULL > 0 is NULL.
  • Using NOT IN (subquery) when the subquery can contain NULL — returns zero rows because x NOT IN (..., NULL, ...) is NULL, which fails the predicate.
  • Comma-separated FROM a, b with no ON clause — produces a Cartesian product (CROSS JOIN); usually a bug.
  • Joining on the wrong column (o.id = c.id instead of o.customer_id = c.id) — produces nonsense rows.

PostgreSQL Interview Question on Customers With No Orders

Given customers(id, name) and orders(order_id, customer_id, amount), return the names of customers who have never placed an order.

Solution Using LEFT JOIN ... WHERE orders.order_id IS NULL

SELECT c.name
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL
ORDER BY c.name;
Enter fullscreen mode Exit fullscreen mode

Why this works: the LEFT JOIN preserves every customer row regardless of whether a matching order exists; for matched customers, o.order_id carries a real value; for unmatched customers, the right-side columns are NULL and the WHERE o.order_id IS NULL predicate is TRUE; the filter keeps only the unmatched customers — the anti-join. Single pass over customers; one keyed lookup into orders per customer; no subquery materialization needed.

Step-by-step trace for the sample input:

customers.id customers.name LEFT JOIN orders.order_id IS NULL? survives?
1 Alice 101
2 Bob 102
3 Carol NULL
4 Dan NULL

Carol and Dan survive the filter.

Output:

name
Carol
Dan

Why this works — concept by concept:

  • LEFT JOIN semantics — keeps every left row; right side is NULL when there is no match. This NULL is the entire signal we filter on.
  • WHERE o.order_id IS NULLo.order_id is the right-side primary key; it is NULL only when the join produced a synthetic unmatched row. A real-NULL order-id from the source table never happens because primary keys are NOT NULL.
  • Anti-join semantics — equivalent to NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id); the LEFT JOIN ... IS NULL form is typically faster on planners that materialize a hash join.
  • No NULL-swallowing — unlike NOT IN, the predicate is IS NULL, which is well-defined for NULL values. There is no silent zero-row failure.
  • ORDER BY c.name — deterministic ordering for reviewer stability.
  • O(|customers| + |orders|) time — hash-join build on orders.customer_id, single probe per customer. With an index on orders.customer_id this is near-linear.

Inline CTA: Drill the SQL joins practice page for INNER, LEFT, and anti-join shapes, and the SQL null-handling practice page for NULL-aware predicates.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →


3. PostgreSQL GROUP BY, HAVING, and Conditional Aggregates

GROUP BY with HAVING, FILTER, and CASE for one-pass metrics in PostgreSQL

"Compute total revenue, refunded revenue, and the percentage refunded — in a single query" is the signature conditional-aggregate prompt — and the cleanest PostgreSQL answer is SUM(... ) FILTER (WHERE …) clauses inside a single SELECT. The mental model: GROUP BY col collapses rows into buckets; COUNT(*), SUM(...), AVG(...), MIN(...), MAX(...) summarize each bucket; WHERE filters individual rows before grouping; HAVING filters groups after grouping; FILTER (WHERE …) and CASE WHEN … express conditional aggregates that count or sum only certain rows per group. The duplicate-finder pattern GROUP BY key HAVING COUNT(*) > 1 lives here too.

Two-panel PostgreSQL SQL Cheat Sheet diagram: left panel WHERE filters individual rows (orange funnel with rows flowing into filtered output); right panel HAVING filters groups (group boxes labeled kept and rejected separated by a purple filter bar) with pipecode.ai attribution.

Pro tip: PostgreSQL supports the SQL standard FILTER (WHERE …) clause on every aggregate — COUNT(*) FILTER (WHERE status = 'refunded'). It produces clearer queries than SUM(CASE WHEN … THEN 1 ELSE 0 END) and is exactly what interviewers like to see. The CASE variant still works for portability across dialects.

COUNT, SUM, AVG, MIN, MAXNULL-aware aggregates

The aggregate-NULL invariant: COUNT(*) counts every row including ones with NULL columns; COUNT(col) counts only rows where col is not NULL; SUM, AVG, MIN, MAX skip NULL values entirely; if every value in a group is NULL, the result is NULL (not 0). The distinction between COUNT(*) and COUNT(col) is the #1 source of "my counts are off by 10%" bugs.

  • COUNT(*) — every row in the bucket, regardless of NULLs.
  • COUNT(col) — non-NULL values of col only.
  • COUNT(DISTINCT col) — unique non-NULL values; essential after a JOIN that may have inflated rows.
  • SUM / AVG — numeric only; AVG is sum-of-non-null-divided-by-count-of-non-null, so NULL does not count as 0.

Worked example. Three rows in one customer's bucket: amount = 10, NULL, 30.

aggregate result
COUNT(*) 3
COUNT(amount) 2
SUM(amount) 40
AVG(amount) 20
MIN(amount) 10
MAX(amount) 30

Step-by-step explanation.

  1. COUNT(*) = 3 because every row in the bucket counts, regardless of amount's value.
  2. COUNT(amount) = 2 because the NULL row is skipped; only 10 and 30 contribute.
  3. SUM(amount) = 10 + 30 = 40; the NULL is treated as missing, not as 0.
  4. AVG(amount) = (10 + 30) / 2 = 20; the denominator is COUNT(amount) = 2, not COUNT(*) = 3.
  5. MIN and MAX skip the NULL and return the smallest/largest non-NULL value.

Worked-example solution.

SELECT customer_id,
       COUNT(*)        AS n_rows,
       COUNT(amount)   AS n_known,
       SUM(amount)     AS total,
       AVG(amount)     AS mean,
       MIN(amount)     AS lo,
       MAX(amount)     AS hi
FROM orders
GROUP BY customer_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the metric is "people who clicked" use COUNT(DISTINCT user_id); if it is "click events" use COUNT(*); if it is "rows with a known value" use COUNT(col).

WHERE vs HAVING — row filter vs group filter

The two-clause invariant: WHERE runs before GROUP BY and references raw row columns; HAVING runs after grouping and can reference aggregate functions; trying to use WHERE COUNT(*) > 1 is a parse error because aggregates do not exist until after grouping. Both can appear in the same query.

  • WHERE — filter rows; uses col, col2, etc.
  • HAVING — filter groups; uses COUNT(*), SUM(col), etc.
  • Order of evaluationFROMWHEREGROUP BYHAVINGSELECT.
  • Performance — push predicates into WHERE whenever possible; WHERE filters before the (often expensive) sort/hash step.

Worked example. Six employees across eng and sales; find departments whose average salary exceeds 50,000 across employees earning more than 30,000.

department salary
eng 40,000
eng 70,000
eng 25,000
sales 60,000
sales 60,000
sales 20,000

Step-by-step explanation.

  1. WHERE salary > 30000 drops the two rows below the threshold (eng 25,000 and sales 20,000) — 4 rows remain.
  2. GROUP BY department collapses to two buckets: eng (40,000 + 70,000) and sales (60,000 + 60,000).
  3. The planner computes AVG(salary) per bucket: eng = 55,000; sales = 60,000.
  4. HAVING AVG(salary) > 50000 keeps both buckets (both averages exceed 50,000).
  5. The SELECT projects the department name and its average; final result is two rows.

Worked-example solution.

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE salary > 30000
GROUP BY department
HAVING AVG(salary) > 50000;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: aggregate predicate → HAVING; row predicate → WHERE. If the predicate uses SUM / COUNT / AVG / MIN / MAX, it must live in HAVING.

FILTER (WHERE …) and CASE — conditional aggregates

The conditional-aggregate invariant: SUM(col) FILTER (WHERE pred) and COUNT(*) FILTER (WHERE pred) apply the aggregate only to rows where the predicate is TRUE; the portable alternative is SUM(CASE WHEN pred THEN col ELSE 0 END) and COUNT(CASE WHEN pred THEN 1 END). PostgreSQL supports both; pick FILTER for clarity in PostgreSQL-only code, CASE for cross-dialect portability.

  • FILTER (WHERE …) — PostgreSQL/SQL-standard syntax; applies per-aggregate.
  • SUM(CASE WHEN … THEN col ELSE 0 END) — portable across dialects.
  • COUNT(CASE WHEN … THEN 1 END) — counts only matching rows; NULLs in the ELSE branch are skipped.
  • Multiple aggregates, one query — combine many FILTER clauses to compute several metrics in one pass.

Worked example. One pass over orders to compute total revenue, refunded revenue, and the refund rate.

customer_id total_revenue refunded_revenue refund_pct
101 500 50 10.0
102 1,000 0 0.0
103 800 200 25.0

Step-by-step explanation.

  1. SUM(amount) aggregates every row in the bucket → total_revenue.
  2. SUM(amount) FILTER (WHERE status = 'refunded') aggregates only refunded rows → refunded_revenue.
  3. The refund percentage is refunded_revenue / total_revenue * 100; cast one side to NUMERIC to avoid integer division.
  4. PostgreSQL evaluates every FILTER independently per row of input; one scan computes all metrics.
  5. The portable variant uses SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) — same result, slightly more verbose.

Worked-example solution.

SELECT customer_id,
       SUM(amount)                              AS total_revenue,
       SUM(amount) FILTER (WHERE status = 'refunded') AS refunded_revenue,
       ROUND(
         SUM(amount) FILTER (WHERE status = 'refunded')::NUMERIC
         / NULLIF(SUM(amount), 0) * 100, 1
       ) AS refund_pct
FROM orders
GROUP BY customer_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: whenever you find yourself running two queries with different WHERE clauses against the same table and joining the results, refactor to a single query with two FILTER clauses — same answer, half the cost.

Common beginner mistakes

  • WHERE COUNT(*) > 1 — parse error; aggregates do not exist until after GROUP BY. Use HAVING.
  • AVG(col) and assuming NULL rows count as 0 — they are excluded from both numerator and denominator. Use AVG(COALESCE(col, 0)) only if "missing means 0" is the business rule.
  • COUNT(DISTINCT col) forgotten after a JOIN that inflates rows — reports inflated counts.
  • Integer division — 5 / 100 = 0 in PostgreSQL. Cast one operand to NUMERIC or FLOAT.
  • Division by zero — NULLIF(denom, 0) converts 0 to NULL, so the division returns NULL instead of erroring.

PostgreSQL Interview Question on Duplicate Emails

Given users(id, email), return every email that appears more than once, along with the number of copies.

Solution Using GROUP BY email HAVING COUNT(*) > 1

SELECT email,
       COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY n_copies DESC, email;
Enter fullscreen mode Exit fullscreen mode

Why this works: GROUP BY email collapses every row with the same email into a single bucket; COUNT(*) counts how many rows fell into each bucket; HAVING COUNT(*) > 1 keeps only buckets with at least two rows; ORDER BY n_copies DESC, email produces a deterministic, reviewer-friendly output. Single pass over users; sort cost dominates only when email cardinality is huge.

Step-by-step trace for the sample input:

  1. FROM users — read all six rows.
  2. No WHERE — every row passes.
  3. GROUP BY email — three buckets: alice (3 rows), bob (2 rows), carol (1 row).
  4. COUNT(*) — 3, 2, 1 respectively.
  5. HAVING COUNT(*) > 1 — drops the carol bucket.
  6. ORDER BY n_copies DESC, email — alice (3), then bob (2).

Output:

email n_copies
alice@example.com 3
bob@example.com 2

Why this works — concept by concept:

  • GROUP BY email — collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters.
  • COUNT(*) — counts every row in the bucket, perfect for "how many copies".
  • HAVING COUNT(*) > 1 — group-level filter; the aggregate predicate must live here, not in WHERE. This is the precise interview signal for duplicate detection.
  • ORDER BY n_copies DESC, email — deterministic ordering; tie-broken by email so the output is stable across runs.
  • O(|users| + G log G) time — single hash aggregation produces G group rows; the final sort is G log G. With an index on email, the planner may use stream aggregation and skip the hash step.

Inline CTA: Drill the SQL aggregation practice page for GROUP BY and HAVING shapes, and the SQL filtering practice page for WHERE vs HAVING distinctions.

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →


4. PostgreSQL Window Functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

Ranking, top-N-per-group, running totals, and lookback in PostgreSQL window functions

"Find the second-highest distinct salary" and "compute a running total of daily revenue" are the two signature window-function prompts — and both reduce to a window function with OVER (PARTITION BY … ORDER BY …). The mental model: a window function computes a value across a window of rows related to the current row without collapsing the rows like GROUP BY does; OVER (PARTITION BY col) defines the window boundary; OVER (ORDER BY col) defines the order within the window. ROW_NUMBER assigns unique integers; RANK skips after ties (1, 2, 2, 4); DENSE_RANK does not skip (1, 2, 2, 3); LAG looks back; LEAD looks forward; SUM/AVG/COUNT(...) OVER (...) compute running totals and moving averages.

Three-column comparison of PostgreSQL window functions on a salary ladder with tied rows: ROW_NUMBER yields unique 1-2-3-4, RANK yields 1-2-2-4 with a +2 skip, DENSE_RANK yields 1-2-2-3 with no gap; a caption explains DENSE_RANK equals N for the Nth distinct value, plus an inset showing a running total via SUM(amount) OVER (ORDER BY date) on a small sales table.

Pro tip: Window functions cannot be referenced in WHERE of the same SELECT because they execute after WHERE. Wrap the window in a CTE or subquery, then filter on the alias. The error column "rn" does not exist after writing WHERE rn = 1 almost always means you forgot this rule.

ROW_NUMBER — unique sequential numbering per partition

The ROW_NUMBER invariant: ROW_NUMBER() OVER (PARTITION BY p ORDER BY o) assigns a unique integer 1, 2, 3, … to every row inside each partition p, ordered by o; ties in o are broken arbitrarily by the planner. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication (keep rn = 1 per business key).

  • OVER (PARTITION BY …) — bucket the rows; without this, the window is the whole result set.
  • OVER (ORDER BY …) — order within the bucket; required for ROW_NUMBER/RANK/LAG/LEAD.
  • Ties broken arbitrarily — add a tiebreaker column to ORDER BY for determinism.
  • Top-N-per-groupWHERE rn <= N after ROW_NUMBER; works only when ties at rank N can be ignored.

Worked example. employees with three engineers; rank by salary descending.

department name salary row_number
eng Alice 90,000 1
eng Bob 80,000 2
eng Carol 80,000 3

Bob and Carol tie on salary; ROW_NUMBER still gives them unique ranks (planner-chosen unless you add a tiebreaker).

Step-by-step explanation.

  1. PARTITION BY department defines the boundary — only eng rows are compared with each other; if there were a sales partition it would have its own 1, 2, 3 sequence.
  2. ORDER BY salary DESC, name orders rows within the partition: Alice (90,000) first, then Bob and Carol (tied at 80,000) broken by name.
  3. ROW_NUMBER() assigns 1, 2, 3 sequentially regardless of ties; Bob gets 2 and Carol gets 3 because name breaks the tie.
  4. Without the , name tiebreaker, Bob/Carol order is undefined — two query runs could swap them.
  5. To deduplicate a table that has multiple rows per (business_key, source_ts), use ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1 to keep the latest.

Worked-example solution.

SELECT department, name, salary,
       ROW_NUMBER() OVER (
         PARTITION BY department
         ORDER BY salary DESC, name
       ) AS rn
FROM employees;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: ROW_NUMBER is the right tool for deduplication (WHERE rn = 1) and for ordered streams; reach for RANK or DENSE_RANK when ties must be honored.

RANK vs DENSE_RANK — tie semantics

The rank-vs-dense-rank invariant: both assign the same rank to tied rows; RANK then skips the next k-1 ranks (gap), while DENSE_RANK continues without a gap. For "find the Nth distinct value" questions, DENSE_RANK = N is the correct filter; for "find the Nth row in skip-aware ranking order", RANK = N is correct.

  • RANK1, 2, 2, 4 — skips after ties.
  • DENSE_RANK1, 2, 2, 3 — no skip.
  • ROW_NUMBER1, 2, 3, 4 — never ties.
  • Pick by semantics — "Nth highest distinct salary" → DENSE_RANK = N; "Nth-ranked row in skip ordering" → RANK = N.

Worked example. Four employees; Bob and Carol tied at second-highest salary.

name salary rank dense_rank row_number
Alice 90,000 1 1 1
Bob 80,000 2 2 2
Carol 80,000 2 2 3
Dan 70,000 4 3 4

RANK jumps 2 → 4 (skipping 3); DENSE_RANK continues 2 → 3 (no skip).

Step-by-step explanation.

  1. All three window functions agree on Alice (rank 1) because she is alone at the top.
  2. Bob and Carol both get rank = 2 and dense_rank = 2 because they tie on salary; row_number gives them distinct values 2 and 3.
  3. Dan is the next-lowest salary; RANK skips ahead by the number of tied rows (2 tied → next rank is 2 + 2 = 4); DENSE_RANK continues with no gap (3).
  4. For "second highest distinct salary", DENSE_RANK = 2 correctly returns 80,000; RANK = 2 would also work here, but RANK would not return 80,000 if three people tied for first (it would skip to 4).
  5. For "top 3 distinct salaries", use DENSE_RANK <= 3 — it returns Alice, Bob, Carol, Dan (four rows because Bob/Carol both have dr = 2).

Worked-example solution.

SELECT name, salary,
       RANK()       OVER (ORDER BY salary DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dr,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "second highest salary" → DENSE_RANK = 2; "top 3 distinct salaries" → DENSE_RANK <= 3; never use RANK for these unless the spec explicitly says ties should consume rank slots.

LAG, LEAD, and running totals — lookback, lookahead, and SUM(...) OVER (...)

The lookback-and-running invariant: LAG(col, n) returns the value of col n rows back within the partition (default n=1); LEAD(col, n) is the symmetric forward; SUM(col) OVER (PARTITION BY p ORDER BY o) produces a running total within each partition. These three primitives drive month-over-month deltas, sessionization, running balances, and moving averages.

  • LAG(amount) OVER (ORDER BY date) — previous day's amount.
  • LEAD(amount) OVER (ORDER BY date) — next day's amount.
  • amount - LAG(amount) OVER (ORDER BY date) — day-over-day delta.
  • SUM(amount) OVER (ORDER BY date) — running total from start of partition through current row.

Worked example. Three days of sales; compute previous-day amount, day-over-day delta, and running total.

sales_date amount prev_amount dod_delta running_total
2026-05-09 100 NULL NULL 100
2026-05-10 130 100 30 230
2026-05-11 120 130 -10 350

The first day has LAG = NULL because no prior row exists; consumers usually COALESCE(delta, 0) for display.

Step-by-step explanation.

  1. LAG(amount) OVER (ORDER BY sales_date) returns the previous row's amount, ordered by date.
  2. Day 1 (May 9): no previous row, so LAG = NULL; amount - LAG = NULL.
  3. Day 2 (May 10): LAG = 100; delta = 130 - 100 = 30.
  4. Day 3 (May 11): LAG = 130; delta = 120 - 130 = -10.
  5. SUM(amount) OVER (ORDER BY sales_date) accumulates from the start of the partition through the current row: 100, 230, 350.

Worked-example solution.

SELECT sales_date,
       amount,
       LAG(amount) OVER (ORDER BY sales_date)            AS prev_amount,
       amount - LAG(amount) OVER (ORDER BY sales_date)   AS dod_delta,
       SUM(amount) OVER (ORDER BY sales_date)            AS running_total
FROM sales
ORDER BY sales_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: LAG for "compare this row to its predecessor" (delta, retention, gap); LEAD for "what happens next" (sessionization, churn-from-here); SUM(...) OVER (...) for running totals — always PARTITION BY the entity if the table holds multiple series.

Common beginner mistakes

  • Using RANK when the question wants the Nth distinct value — RANK = 2 skips entirely if two rows tie for first.
  • Forgetting PARTITION BY for a per-group ranking — produces a global ranking instead of per-department.
  • Referencing the window-function alias in WHERE of the same SELECT — window functions execute after WHERE; wrap in a CTE or subquery first.
  • Confusing LAG (previous) with LEAD (next) — quietly produces inverted deltas.
  • Forgetting ORDER BY inside OVER for ROW_NUMBER/RANK/LAG/LEAD — required; the result is non-deterministic without it.

PostgreSQL Interview Question on Top 3 Salaries Per Department

Given employees(emp_id, name, department, salary), return the top 3 distinct salaries per department, with ties at rank 3 included. Output department, name, salary, and the rank.

Solution Using DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) in a CTE

WITH ranked AS (
    SELECT department,
           name,
           salary,
           DENSE_RANK() OVER (
               PARTITION BY department
               ORDER BY salary DESC
           ) AS dr
    FROM employees
)
SELECT department, name, salary, dr
FROM ranked
WHERE dr <= 3
ORDER BY department, dr, name;
Enter fullscreen mode Exit fullscreen mode

Why this works: the CTE ranked materializes a per-department DENSE_RANK keyed by salary descending — dr = 1 is the highest distinct salary in that department, dr = 2 is the second-highest, and so on; the outer WHERE dr <= 3 keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the ORDER BY produces a deterministic, reviewer-friendly output. DENSE_RANK over RANK because the spec wants the top three distinct salaries; DENSE_RANK over ROW_NUMBER because ties at rank 3 must be retained.

Step-by-step trace for the sample input:

emp_id name department salary
1 Alice eng 90,000
2 Bob eng 80,000
3 Carol eng 80,000
4 Dan eng 70,000
5 Eve eng 60,000
6 Frank sales 100,000
7 Grace sales 90,000
8 Heidi sales 80,000
  1. CTE ranked — partition by department; order by salary DESC.
  2. DENSE_RANK per partition — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3.
  3. Outer WHERE dr <= 3 — drops Eve (dr = 4); keeps both Bob and Carol (tied at 2) and Dan (3).
  4. ORDER BY department, dr, name — eng rows first, then sales; within department by dr, then name for tiebreak.

Output:

department name salary dr
eng Alice 90000 1
eng Bob 80000 2
eng Carol 80000 2
eng Dan 70000 3
sales Frank 100000 1
sales Grace 90000 2
sales Heidi 80000 3

Why this works — concept by concept:

  • CTE ranked — names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery.
  • PARTITION BY department — restarts the rank at each department boundary; without this, the rank is global and the answer is wrong.
  • ORDER BY salary DESC — defines "highest first" inside each partition; required for any deterministic ranking.
  • DENSE_RANK over RANK — the spec wants the top three distinct salaries; RANK would skip after ties and miss the third distinct salary if there is a two-way tie above it.
  • WHERE dr <= 3 in the outer — window functions cannot be referenced in WHERE of the same SELECT; the CTE provides the materialized column the outer can filter on.
  • O(N log N) time — sort within each partition dominates; with an index on (department, salary DESC) the planner can stream rather than sort.

Inline CTA: More SQL window-function practice problems and SQL CTE practice problems on PipeCode.

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — CTE
SQL CTE problems

Practice →

SQL
Topic — date functions
SQL date-function problems

Practice →


Tips to use this PostgreSQL cheat sheet effectively

Hold the clause-order diagram in your head

FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT. Memorize this sentence and 80% of "weird" PostgreSQL parse errors decode themselves in five seconds. The error column "x" does not exist almost always means you referenced a SELECT alias in WHERE; the error aggregate functions are not allowed in WHERE means you wanted HAVING instead.

State the grain before any JOIN

Before writing the JOIN, name the grain you're producing: "this is order-line grain", "this is customer-day grain", "this is (customer, product) grain". The single most common bug in analytical SQL is SUM(left.col) after a 1:N join — the metric is silently multiplied by N. If grain doubles, you'll spot it immediately.

Use LEFT JOIN ... IS NULL over NOT IN for anti-joins

NOT IN (subquery) returns zero rows when the subquery contains a single NULL because x NOT IN (..., NULL, ...) is NULL, which fails the WHERE predicate. LEFT JOIN ... WHERE right.id IS NULL and NOT EXISTS (...) are immune. Production engineers who have been burned once never write NOT IN again.

Pick DENSE_RANK for "Nth distinct"; pick ROW_NUMBER for deduplication

The single most-graded ranking distinction: DENSE_RANK = N is the Nth distinct value; RANK = N is the Nth row in skip-aware ranking order; ROW_NUMBER = N is the Nth row in arbitrary order. For "second-highest distinct salary" → DENSE_RANK = 2. For "remove duplicate rows keeping the canonical one" → ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1.

Use FILTER (WHERE …) for one-pass conditional metrics

SUM(amount) FILTER (WHERE status = 'refunded') is cleaner than SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) — PostgreSQL supports both. Use FILTER in PostgreSQL-only code, CASE for cross-dialect portability. One scan, many metrics, half the cost of two queries.

Always ORDER BY + tiebreaker; pair LIMIT with ORDER BY

Window functions, LIMIT N, and "top result" queries all require an ORDER BY with a deterministic tiebreaker (e.g., ORDER BY salary DESC, name). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering.

Use PostgreSQL-specific helpers — EXTRACT, DATE_TRUNC, INTERVAL, ::DATE cast

EXTRACT(MONTH FROM ts), DATE_TRUNC('month', ts), ts - INTERVAL '1 month', ts::DATE. These four cover 95% of date arithmetic. Reach for DATE_TRUNC whenever the spec says "by month" or "by week" — it groups timestamps to the bucket boundary deterministically.

Where to practice on PipeCode

Start with the SQL practice surface for the all-language SQL corpus. Drill the four-primitive pages: SQL filtering for WHERE patterns, SQL joins for join shapes, SQL aggregation for GROUP BY + HAVING, SQL window functions for ranking and lookback. Add adjacent topics: SQL CTE, SQL subqueries, SQL null-handling, SQL date functions. The interview courses page bundles structured curricula — start with SQL for Data Engineering Interviews — From Zero to FAANG. For broader coverage, browse by topic or read the related SQL interview questions for data engineering and data lake architecture for data engineering interviews blogs.


Frequently Asked Questions

What is the logical clause order in a PostgreSQL query?

PostgreSQL evaluates clauses in the order FROM / JOINWHEREGROUP BYHAVINGSELECTORDER BYLIMIT / OFFSET, regardless of the order you write them. This is why WHERE cannot reference aggregate functions (they don't exist until after GROUP BY) and why SELECT-level aliases cannot be referenced in WHERE (they're computed in stage 5). Aliases become available only in ORDER BY and the outer query in a nested context.

What is the difference between WHERE and HAVING in PostgreSQL?

WHERE filters individual rows before the GROUP BY step and can reference only raw row columns. HAVING filters whole groups after the GROUP BY step and can reference aggregate functions like COUNT(*), SUM(col), AVG(col). Trying to use an aggregate in WHERE (e.g., WHERE COUNT(*) > 1) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query.

How do I find rows in table A that have no match in table B?

The canonical PostgreSQL pattern is SELECT a.* FROM a LEFT JOIN b ON b.fk = a.pk WHERE b.pk IS NULL — the LEFT JOIN preserves every left row, and the WHERE b.pk IS NULL filter keeps only the ones where no right-side match was found. This is the anti-join pattern. Equivalent to WHERE NOT EXISTS (SELECT 1 FROM b WHERE b.fk = a.pk). Both are safer than NOT IN (subquery), which returns zero rows if the subquery contains a single NULL.

What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?

All three assign integers within a window. ROW_NUMBER gives every row a unique sequential integer (1, 2, 3, 4), even on ties. RANK gives tied rows the same rank but skips after them (1, 2, 2, 4). DENSE_RANK gives tied rows the same rank with no skip (1, 2, 2, 3). For "Nth distinct value" use DENSE_RANK = N; for "Nth row in skip-aware ranking order" use RANK = N; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use ROW_NUMBER = 1.

What does FILTER (WHERE …) do in PostgreSQL aggregates?

SUM(col) FILTER (WHERE pred) and COUNT(*) FILTER (WHERE pred) apply the aggregate only to rows where the predicate is TRUE; rows where the predicate is FALSE or NULL are skipped for that aggregate, while other aggregates in the same SELECT still see them. The portable cross-dialect equivalent is SUM(CASE WHEN pred THEN col ELSE 0 END) and COUNT(CASE WHEN pred THEN 1 END). Use FILTER for clarity in PostgreSQL-only code.

How do I compute a running total in PostgreSQL?

Use SUM(col) OVER (PARTITION BY p ORDER BY o) — the window aggregate accumulates from the start of each partition through the current row in the order defined by ORDER BY. Example: SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) gives a per-customer running total of order amounts ordered by date. Drop PARTITION BY for a single global running total.

Why is LIMIT 5 returning different rows on different runs?

LIMIT without ORDER BY is non-deterministic — PostgreSQL returns whatever rows it sees first, which depends on the query plan, parallelism, and table physical layout. Always pair LIMIT with ORDER BY <col> DESC, <tiebreaker> so two runs return the same rows. Reviewers depend on stable ordering, and dashboards break silently when row order drifts.


Start practicing PostgreSQL SQL problems

Top comments (0)