Gowtham Potureddi

Posted on May 11

PostgreSQL SQL Cheat Sheet — Clause Order, Joins, Aggregates, Windows

#python #sql #interview #dataengineering

A PostgreSQL SQL cheat sheet is only useful when every row in it maps to something you can drop straight into a query — not a wall of syntax with no operational explanation. This guide condenses real PostgreSQL fluency to four primitives: the logical clause order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT), the six join shapes and the grain trap they create, GROUP BY with HAVING plus conditional aggregates for one-pass metrics, and window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD for ranking and lookback. These four cover the bulk of analytical SQL — and the cheat-sheet style below is built so you can scan, copy a snippet, and tweak it for your own schema.

Every section walks through a detailed topic explanation, sub-topics with worked examples and runnable solutions, common beginner mistakes, and a worked interview-style scenario with a full traced answer. PostgreSQL syntax throughout — the dialect that drives DataLemur, CoderPad, most product-analytics live screens, and the bulk of modern data-engineering SQL corpora.

#	Topic	Why it shows up in a PostgreSQL cheat sheet
1	Logical clause order — `FROM` → `WHERE` → `GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`	The single most useful PostgreSQL mental model: the order you write clauses is not the order the engine evaluates them; knowing the evaluation order eliminates 80% of parse errors and explains why `WHERE` cannot reference aggregates or column aliases.
2	Joins and grain — `INNER`, `LEFT`, `RIGHT`, `FULL`, `SELF`, `CROSS`	Joins combine rows but they also change grain; a careless `1:N` join inflates row counts silently, and the `LEFT JOIN ... IS NULL` anti-join is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users).
3	`GROUP BY`, `HAVING`, and conditional aggregates	`WHERE` filters rows before grouping; `HAVING` filters groups after; `COUNT() FILTER (WHERE …)` and `SUM(CASE WHEN …)` express many metrics in one query — the universal duplicate finder `HAVING COUNT() > 1` lives here.
4	Window functions — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`	Per-partition ranking without collapsing rows, top-N-per-group, second-highest salary, running totals with `SUM(...) OVER (PARTITION BY ... ORDER BY ...)`, and month-over-month deltas via `LAG`; the most-graded primitive in modern SQL screens.

1. PostgreSQL Logical Clause Order — `FROM` → `WHERE` → `GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`

The seven-stage evaluation order every PostgreSQL query follows

"Why does WHERE customer_count > 5 give me a parse error when I'm clearly counting customers?" is the signature beginner question — and the answer is logical clause order. The mental model: PostgreSQL evaluates clauses in a fixed order that is different from the order you write them; FROM/JOIN builds the row set, WHERE filters rows, GROUP BY collapses rows into groups, HAVING filters groups, SELECT projects columns, ORDER BY sorts, LIMIT/OFFSET trims. WHERE cannot reference aggregate functions because aggregates do not exist until after GROUP BY; column aliases declared in SELECT cannot be referenced in WHERE for the same reason.

Pro tip: Memorize one sentence — "From-Where-Group-Having-Select-Order-Limit" — and you can decode any PostgreSQL parse error in under five seconds. The error column "customer_count" does not exist almost always means the column is a SELECT-level alias being referenced in WHERE, which runs three stages earlier; lift the predicate into HAVING (if it references an aggregate) or repeat the expression inline in WHERE.

`FROM` and `JOIN` — build the working row set

The FROM/JOIN invariant: the first stage assembles a candidate row set by listing the tables (and how they join); every subsequent stage operates on this row set. Subqueries in FROM are also evaluated here, and LATERAL joins let later subqueries reference earlier rows.

Single table — FROM orders produces one row per orders row.
Joined tables — FROM orders o JOIN customers c ON c.id = o.customer_id produces one row per matching pair.
Subquery in FROM — FROM (SELECT ...) t materializes the inner result, then treats it as a table.
LATERAL subquery — FROM orders o, LATERAL (SELECT ... WHERE x = o.id) s re-evaluates the inner subquery per outer row.

Worked example. A FROM with a LEFT JOIN that produces the right row set before any filter runs.

step	output cardinality
`FROM customers` alone	3 rows
`LEFT JOIN orders`	4 rows (Alice has 2 orders, Bob 1, Carol 0 padded with NULLs)
ready for `WHERE` filtering	4 rows

Step-by-step explanation.

The engine reads customers first, producing three rows (Alice, Bob, Carol).
For each customer, it scans orders for matching customer_id rows; Alice matches 2 orders, Bob matches 1, Carol matches 0.
Because the join is LEFT, Carol's row is preserved with the right-side columns filled with NULLs — total 4 rows.
This 4-row stream is what WHERE will see; no filtering has happened yet.
Without understanding FROM runs first, you can't reason about why a WHERE predicate on the right side of a LEFT JOIN silently converts the join into an INNER JOIN.

Worked-example solution.

SELECT c.name, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id;

Rule of thumb: if your LEFT JOIN is producing fewer rows than expected, check whether you have a WHERE predicate that references the right-side table — that predicate runs after the join and discards the NULL-padded rows.

`WHERE` — row-level predicates before grouping

The WHERE invariant: WHERE filters individual rows from the FROM/JOIN output before GROUP BY runs; it can reference any column from the joined row set, but cannot reference aggregate functions or SELECT-level aliases. This is the cheapest place to drop rows — push predicates here whenever possible.

Row predicates — WHERE amount > 30, WHERE order_date >= '2026-01-01'.
IN / EXISTS — WHERE customer_id IN (SELECT id FROM premium), WHERE EXISTS (...).
BETWEEN — inclusive on both ends; WHERE x BETWEEN 1 AND 10 is x >= 1 AND x <= 10.
IS NULL / IS NOT NULL — the only way to check for NULL; never = NULL.

Worked example. Filter to one day of orders before grouping.

filter	rows surviving
no filter	12,847 (full day's orders)
`WHERE order_date = '2026-05-10'`	12,847 (already today's)
`WHERE order_date = '2026-05-10' AND amount > 100`	4,290 (high-value only)

Step-by-step explanation.

FROM orders returns the full row stream.
WHERE order_date = '2026-05-10' is evaluated per row; rows with other dates are dropped.
AND amount > 100 is evaluated next; this is a row predicate (not an aggregate), so it lives in WHERE correctly.
The surviving row set (4,290 rows) flows into GROUP BY if one is present, otherwise into SELECT.
Pushing the date filter into WHERE rather than HAVING is critical for index usage: a B-tree index on order_date can prune 95% of the table before any grouping happens.

Worked-example solution.

SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE order_date = '2026-05-10'
  AND amount > 100
GROUP BY customer_id;

Rule of thumb: if the predicate uses only raw row columns, it belongs in WHERE; if it uses SUM, COUNT, AVG, MIN, MAX, it belongs in HAVING.

`GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`

The downstream invariant: after WHERE, the engine evaluates GROUP BY (collapsing rows into one row per distinct key combination), then HAVING (filtering groups), then SELECT (projecting columns and computing expressions), then ORDER BY (sorting the final result), then LIMIT/OFFSET (trimming for pagination). SELECT-level aliases become referenceable only in ORDER BY and the outer query (in a subquery context).

GROUP BY col1, col2 — one output row per distinct (col1, col2) combination.
HAVING agg_pred — filter groups; can reference COUNT(*), SUM(col), etc.
SELECT col, agg(col2) AS x — project columns; aggregates and aliases are computed here.
ORDER BY x DESC, col — can reference SELECT aliases; deterministic with a tiebreaker.
LIMIT N OFFSET M — page slicing; always pair with ORDER BY for determinism.

Worked example. Group by customer, filter to high-spend customers, sort descending, top 5.

stage	rows
`FROM orders WHERE order_date = '2026-05-10'`	4,290
`GROUP BY customer_id`	1,720 (one row per customer)
`HAVING SUM(amount) > 500`	312 (high-spend)
`SELECT customer_id, SUM(amount) AS spend`	312 (projected)
`ORDER BY spend DESC, customer_id`	312 (sorted)
`LIMIT 5`	5

Step-by-step explanation.

WHERE produces 4,290 rows for one day with amount > 100.
GROUP BY customer_id collapses them into 1,720 buckets, one per customer.
HAVING SUM(amount) > 500 keeps only the 312 buckets whose total spend exceeds $500.
SELECT computes the alias spend = SUM(amount) and projects two columns.
ORDER BY spend DESC, customer_id sorts the 312 surviving rows by descending spend with a deterministic tiebreaker; LIMIT 5 returns just the top five.

Worked-example solution.

SELECT customer_id, SUM(amount) AS spend
FROM orders
WHERE order_date = '2026-05-10'
  AND amount > 100
GROUP BY customer_id
HAVING SUM(amount) > 500
ORDER BY spend DESC, customer_id
LIMIT 5;

Rule of thumb: every clause has a fixed slot; if you find yourself wanting WHERE to reference an aggregate, the predicate belongs in HAVING instead — and if you want ORDER BY to use a long expression, alias it in SELECT and reference the alias.

Common beginner mistakes

WHERE COUNT(*) > 1 — parse error; aggregates do not exist until after GROUP BY. Use HAVING.
Referencing a SELECT alias in WHERE — WHERE spend > 100 after SELECT SUM(amount) AS spend fails; either repeat the expression or move to HAVING.
Selecting a non-aggregated, non-GROUP BY column — strict PostgreSQL errors out with "must appear in GROUP BY"; some other dialects pick an arbitrary row silently.
LIMIT 5 without ORDER BY — non-deterministic; two runs of the same query return different rows.
Putting HAVING before GROUP BY — syntax error; the clause order is mandatory.

PostgreSQL Interview Question on Clause Order

Given orders(order_id, customer_id, amount, order_date), find every customer who placed more than 3 orders today with total spend above $500. Return customer_id and total_spend, sorted by total_spend descending.

Solution Using `WHERE` + `GROUP BY` + `HAVING` in the Right Slots

SELECT customer_id,
       SUM(amount) AS total_spend
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY customer_id
HAVING COUNT(*) > 3
   AND SUM(amount) > 500
ORDER BY total_spend DESC, customer_id;

Why this works: WHERE order_date = CURRENT_DATE filters to today's row set first (cheap, index-friendly); GROUP BY customer_id collapses to one row per customer; HAVING evaluates the two aggregate predicates together (more than 3 orders AND total > $500); SELECT projects the alias total_spend; ORDER BY total_spend DESC, customer_id produces a deterministic ordering. Single pass over today's rows with hash aggregation.

Step-by-step trace for sample data on 2026-05-10:

customer_id	orders today	sum(amount)	passes HAVING?
101	5	720	✓
102	4	410	✗ (sum ≤ 500)
103	6	1,250	✓
104	2	800	✗ (count ≤ 3)
105	4	520	✓

Three customers survive both predicates.

Output:

customer_id	total_spend
103	1250
101	720
105	520

Why this works — concept by concept:

WHERE first — order_date = CURRENT_DATE is a row predicate using a non-aggregated column; pushing it into WHERE shrinks the row set before grouping and lets the planner use a B-tree index on order_date.
GROUP BY customer_id — collapses today's rows into one bucket per customer; every subsequent aggregate is computed inside this bucket.
HAVING two-predicate AND — COUNT(*) > 3 and SUM(amount) > 500 are both aggregate predicates; combining them with AND in a single HAVING is the canonical multi-condition group filter.
SELECT projection + alias — SUM(amount) AS total_spend is computed here; the alias becomes available to ORDER BY (but not to WHERE / HAVING).
ORDER BY total_spend DESC, customer_id — descending sort on the metric with a deterministic tiebreaker via customer_id; reviewers depend on stable ordering.
O(|today's orders| + G log G) time — single hash aggregation produces G groups; final sort is G log G. With an index on (order_date, customer_id) the planner can stream rather than hash.

Inline CTA: Drill the SQL filtering practice page for WHERE patterns and the SQL aggregation practice page for GROUP BY + HAVING shapes.

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

2. PostgreSQL Joins and Grain — `INNER`, `LEFT`, `RIGHT`, `FULL`, `SELF`, `CROSS`

Joins, anti-joins, and the grain-inflation trap in PostgreSQL

"Why is SUM(amount) returning double what I expect after I add a JOIN?" is the signature grain-inflation question — and the answer is that joins do not just combine columns; they change the row cardinality of the result. The mental model: INNER JOIN keeps only matching pairs, LEFT JOIN keeps every left row and pads the right side with NULLs, RIGHT JOIN is the mirror, FULL OUTER JOIN keeps both sides' unmatched rows, SELF JOIN joins a table to itself (for hierarchies and pair queries), CROSS JOIN produces a Cartesian product (one row per (left, right) pair). The cardinality of any join is bounded by |left| × |right|, and a 1:N relationship inflates left rows by N — the silent source of doubled metrics.

Pro tip: Before writing any join, ask "what is the grain of the result?" — orders, order lines, customer-day, or (customer, product) pair. A 1:N join (e.g., customers to orders) inflates customer rows by the number of orders; SUM(customer.lifetime_value) after that join returns lifetime value × order count, not lifetime value. Always state the grain out loud.

`INNER JOIN` — keep only matching pairs (no padding)

The INNER JOIN invariant: a left row is paired with a right row iff the join predicate is TRUE; unmatched rows on either side are discarded; the result cardinality is the count of matching pairs. This is the most common join and the fastest because the planner can short-circuit on no-match.

ON l.key = r.key — single-column equi-join; the planner hashes the right table.
Multi-column — ON l.a = r.a AND l.b = r.b for composite keys.
Non-equi — ON l.range_start <= r.point AND l.range_end >= r.point (range join).
USING (col) — shorthand when both sides share the column name; merges the column.

Worked example. Two tables, three customers, two orders; one customer has no order.

customer	order_id
Alice	101
Bob	102

Carol (no orders) does not appear — INNER JOIN dropped her.

Step-by-step explanation.

The engine reads customers (Alice, Bob, Carol) and orders (101 for Alice, 102 for Bob).
For each customers row, it scans orders for a matching customer_id.
Alice matches order_id = 101; Bob matches order_id = 102; Carol has no match.
Carol's row is silently discarded because the join is INNER — no NULL-padded row is produced.
The output has two rows because there were two matching pairs; the result cardinality is min(|customers|, |orders|) ≤ N ≤ |customers| × |orders|.

Worked-example solution.

SELECT c.name AS customer, o.order_id
FROM customers c
INNER JOIN orders o
  ON o.customer_id = c.id;

Rule of thumb: reach for INNER JOIN whenever the question is "rows where both sides exist"; it is the smallest, fastest, most common join.

`LEFT JOIN` — keep every left row, pad the right with `NULL`s (anti-join trick)

The LEFT JOIN invariant: every row from the left table appears in the output; if no right row matches, the right columns are NULL; LEFT JOIN ... WHERE right.key IS NULL keeps exactly the left rows that had no match — the anti-join idiom. RIGHT JOIN is the mirror; flip the table order and use LEFT for consistency.

LEFT JOIN — preserves every left row.
Right columns NULL when no match — the key signal for anti-joins.
LEFT JOIN ... IS NULL anti-join — "find rows in A with no match in B".
RIGHT JOIN — mirror image; rarely needed (just flip table order and use LEFT).

Worked example. Same customers + orders; Carol is preserved with NULL right-side columns.

customer	order_id
Alice	101
Bob	102
Carol	NULL

Step-by-step explanation.

For each customers row, scan orders for a matching customer_id.
Alice matches → row (Alice, 101); Bob matches → row (Bob, 102).
Carol does not match → row (Carol, NULL) is produced because the join is LEFT.
To find Carol via the anti-join: add WHERE o.order_id IS NULL after the LEFT JOIN; only Carol's row passes the filter.
Equivalent to WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id) and (under NOT NULL constraints) WHERE c.id NOT IN (SELECT customer_id FROM orders) — but the anti-join is immune to the NOT IN NULL-swallowing bug.

Worked-example solution.

SELECT c.name AS customer
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL;

Rule of thumb: "find X with no Y" → LEFT JOIN ... WHERE Y.id IS NULL. Memorize this; it is the most-asked join shape in SQL interviews and the cleanest fix for the NOT IN ... NULL trap.

`FULL OUTER`, `SELF`, and `CROSS` joins — the rarer shapes

The rarer-joins invariant: FULL OUTER JOIN keeps every left row AND every right row (with NULL padding on the unmatched side); SELF JOIN joins a table to itself by aliasing it twice (employees-and-managers, parent-child, pair queries); CROSS JOIN produces every (left, right) combination — the Cartesian product. Each has a narrow but important use.

FULL OUTER JOIN — reconcile two sources; rows from either side without a match get padded.
SELF JOIN — employee/manager, hierarchical recursion (alternative to recursive CTE), pair queries.
CROSS JOIN — generate every combination (small tables only) or paired with LATERAL for top-N per row.
Implicit cross join — comma-separated tables (FROM a, b) without an ON is a CROSS JOIN — usually a bug.

Worked example. Self-join employees to itself to surface each person's manager.

name	manager_name
Alice	Carol
Bob	Carol
Carol	NULL (CEO)

Step-by-step explanation.

Alias the same employees table twice: e (for employees) and m (for managers).
LEFT JOIN on e.manager_id = m.emp_id looks each employee up against the manager rows.
Alice's manager_id points to Carol → row (Alice, Carol).
Bob's manager_id points to Carol → row (Bob, Carol).
Carol is the CEO so her manager_id IS NULL → no match → row (Carol, NULL) because the join is LEFT.

Worked-example solution.

SELECT e.name,
       m.name AS manager_name
FROM employees e
LEFT JOIN employees m
  ON m.emp_id = e.manager_id;

Rule of thumb: SELF JOIN is one-level hierarchy; for arbitrary-depth recursion (org chart traversal, BOM tree), reach for WITH RECURSIVE instead.

Common beginner mistakes

Forgetting that a 1:N JOIN inflates the left side — SUM(left.col) returns left.col × N.
Filtering the right table inside WHERE after a LEFT JOIN (e.g., WHERE o.amount > 0) — silently turns the LEFT JOIN into an INNER JOIN because NULL > 0 is NULL.
Using NOT IN (subquery) when the subquery can contain NULL — returns zero rows because x NOT IN (..., NULL, ...) is NULL, which fails the predicate.
Comma-separated FROM a, b with no ON clause — produces a Cartesian product (CROSS JOIN); usually a bug.
Joining on the wrong column (o.id = c.id instead of o.customer_id = c.id) — produces nonsense rows.

PostgreSQL Interview Question on Customers With No Orders

Given customers(id, name) and orders(order_id, customer_id, amount), return the names of customers who have never placed an order.

Solution Using `LEFT JOIN ... WHERE orders.order_id IS NULL`

SELECT c.name
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL
ORDER BY c.name;

Why this works: the LEFT JOIN preserves every customer row regardless of whether a matching order exists; for matched customers, o.order_id carries a real value; for unmatched customers, the right-side columns are NULL and the WHERE o.order_id IS NULL predicate is TRUE; the filter keeps only the unmatched customers — the anti-join. Single pass over customers; one keyed lookup into orders per customer; no subquery materialization needed.

Step-by-step trace for the sample input:

customers.id	customers.name	LEFT JOIN orders.order_id	IS NULL?	survives?
1	Alice	101	✗	✗
2	Bob	102	✗	✗
3	Carol	NULL	✓	✓
4	Dan	NULL	✓	✓

Carol and Dan survive the filter.

Output:

name
Carol
Dan

Why this works — concept by concept:

LEFT JOIN semantics — keeps every left row; right side is NULL when there is no match. This NULL is the entire signal we filter on.
WHERE o.order_id IS NULL — o.order_id is the right-side primary key; it is NULL only when the join produced a synthetic unmatched row. A real-NULL order-id from the source table never happens because primary keys are NOT NULL.
Anti-join semantics — equivalent to NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id); the LEFT JOIN ... IS NULL form is typically faster on planners that materialize a hash join.
No NULL-swallowing — unlike NOT IN, the predicate is IS NULL, which is well-defined for NULL values. There is no silent zero-row failure.
ORDER BY c.name — deterministic ordering for reviewer stability.
O(|customers| + |orders|) time — hash-join build on orders.customer_id, single probe per customer. With an index on orders.customer_id this is near-linear.

Inline CTA: Drill the SQL joins practice page for INNER, LEFT, and anti-join shapes, and the SQL null-handling practice page for NULL-aware predicates.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

3. PostgreSQL `GROUP BY`, `HAVING`, and Conditional Aggregates

`GROUP BY` with `HAVING`, `FILTER`, and `CASE` for one-pass metrics in PostgreSQL

"Compute total revenue, refunded revenue, and the percentage refunded — in a single query" is the signature conditional-aggregate prompt — and the cleanest PostgreSQL answer is SUM(... ) FILTER (WHERE …) clauses inside a single SELECT. The mental model: GROUP BY col collapses rows into buckets; COUNT(*), SUM(...), AVG(...), MIN(...), MAX(...) summarize each bucket; WHERE filters individual rows before grouping; HAVING filters groups after grouping; FILTER (WHERE …) and CASE WHEN … express conditional aggregates that count or sum only certain rows per group. The duplicate-finder pattern GROUP BY key HAVING COUNT(*) > 1 lives here too.

Pro tip: PostgreSQL supports the SQL standard FILTER (WHERE …) clause on every aggregate — COUNT(*) FILTER (WHERE status = 'refunded'). It produces clearer queries than SUM(CASE WHEN … THEN 1 ELSE 0 END) and is exactly what interviewers like to see. The CASE variant still works for portability across dialects.

`COUNT`, `SUM`, `AVG`, `MIN`, `MAX` — `NULL`-aware aggregates

The aggregate-NULL invariant: COUNT(*) counts every row including ones with NULL columns; COUNT(col) counts only rows where col is not NULL; SUM, AVG, MIN, MAX skip NULL values entirely; if every value in a group is NULL, the result is NULL (not 0). The distinction between COUNT(*) and COUNT(col) is the #1 source of "my counts are off by 10%" bugs.

COUNT(*) — every row in the bucket, regardless of NULLs.
COUNT(col) — non-NULL values of col only.
COUNT(DISTINCT col) — unique non-NULL values; essential after a JOIN that may have inflated rows.
SUM / AVG — numeric only; AVG is sum-of-non-null-divided-by-count-of-non-null, so NULL does not count as 0.

Worked example. Three rows in one customer's bucket: amount = 10, NULL, 30.

aggregate	result
`COUNT(*)`	3
`COUNT(amount)`	2
`SUM(amount)`	40
`AVG(amount)`	20
`MIN(amount)`	10
`MAX(amount)`	30

Step-by-step explanation.

COUNT(*) = 3 because every row in the bucket counts, regardless of amount's value.
COUNT(amount) = 2 because the NULL row is skipped; only 10 and 30 contribute.
SUM(amount) = 10 + 30 = 40; the NULL is treated as missing, not as 0.
AVG(amount) = (10 + 30) / 2 = 20; the denominator is COUNT(amount) = 2, not COUNT(*) = 3.
MIN and MAX skip the NULL and return the smallest/largest non-NULL value.

Worked-example solution.

SELECT customer_id,
       COUNT(*)        AS n_rows,
       COUNT(amount)   AS n_known,
       SUM(amount)     AS total,
       AVG(amount)     AS mean,
       MIN(amount)     AS lo,
       MAX(amount)     AS hi
FROM orders
GROUP BY customer_id;

Rule of thumb: if the metric is "people who clicked" use COUNT(DISTINCT user_id); if it is "click events" use COUNT(*); if it is "rows with a known value" use COUNT(col).

`WHERE` vs `HAVING` — row filter vs group filter

The two-clause invariant: WHERE runs before GROUP BY and references raw row columns; HAVING runs after grouping and can reference aggregate functions; trying to use WHERE COUNT(*) > 1 is a parse error because aggregates do not exist until after grouping. Both can appear in the same query.

WHERE — filter rows; uses col, col2, etc.
HAVING — filter groups; uses COUNT(*), SUM(col), etc.
Order of evaluation — FROM → WHERE → GROUP BY → HAVING → SELECT.
Performance — push predicates into WHERE whenever possible; WHERE filters before the (often expensive) sort/hash step.

Worked example. Six employees across eng and sales; find departments whose average salary exceeds 50,000 across employees earning more than 30,000.

department	salary
eng	40,000
eng	70,000
eng	25,000
sales	60,000
sales	60,000
sales	20,000

Step-by-step explanation.

WHERE salary > 30000 drops the two rows below the threshold (eng 25,000 and sales 20,000) — 4 rows remain.
GROUP BY department collapses to two buckets: eng (40,000 + 70,000) and sales (60,000 + 60,000).
The planner computes AVG(salary) per bucket: eng = 55,000; sales = 60,000.
HAVING AVG(salary) > 50000 keeps both buckets (both averages exceed 50,000).
The SELECT projects the department name and its average; final result is two rows.

Worked-example solution.

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE salary > 30000
GROUP BY department
HAVING AVG(salary) > 50000;

Rule of thumb: aggregate predicate → HAVING; row predicate → WHERE. If the predicate uses SUM / COUNT / AVG / MIN / MAX, it must live in HAVING.

`FILTER (WHERE …)` and `CASE` — conditional aggregates

The conditional-aggregate invariant: SUM(col) FILTER (WHERE pred) and COUNT(*) FILTER (WHERE pred) apply the aggregate only to rows where the predicate is TRUE; the portable alternative is SUM(CASE WHEN pred THEN col ELSE 0 END) and COUNT(CASE WHEN pred THEN 1 END). PostgreSQL supports both; pick FILTER for clarity in PostgreSQL-only code, CASE for cross-dialect portability.

FILTER (WHERE …) — PostgreSQL/SQL-standard syntax; applies per-aggregate.
SUM(CASE WHEN … THEN col ELSE 0 END) — portable across dialects.
COUNT(CASE WHEN … THEN 1 END) — counts only matching rows; NULLs in the ELSE branch are skipped.
Multiple aggregates, one query — combine many FILTER clauses to compute several metrics in one pass.

Worked example. One pass over orders to compute total revenue, refunded revenue, and the refund rate.

customer_id	total_revenue	refunded_revenue	refund_pct
101	500	50	10.0
102	1,000	0	0.0
103	800	200	25.0

Step-by-step explanation.

SUM(amount) aggregates every row in the bucket → total_revenue.
SUM(amount) FILTER (WHERE status = 'refunded') aggregates only refunded rows → refunded_revenue.
The refund percentage is refunded_revenue / total_revenue * 100; cast one side to NUMERIC to avoid integer division.
PostgreSQL evaluates every FILTER independently per row of input; one scan computes all metrics.
The portable variant uses SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) — same result, slightly more verbose.

Worked-example solution.

SELECT customer_id,
       SUM(amount)                              AS total_revenue,
       SUM(amount) FILTER (WHERE status = 'refunded') AS refunded_revenue,
       ROUND(
         SUM(amount) FILTER (WHERE status = 'refunded')::NUMERIC
         / NULLIF(SUM(amount), 0) * 100, 1
       ) AS refund_pct
FROM orders
GROUP BY customer_id;

Rule of thumb: whenever you find yourself running two queries with different WHERE clauses against the same table and joining the results, refactor to a single query with two FILTER clauses — same answer, half the cost.

Common beginner mistakes

WHERE COUNT(*) > 1 — parse error; aggregates do not exist until after GROUP BY. Use HAVING.
AVG(col) and assuming NULL rows count as 0 — they are excluded from both numerator and denominator. Use AVG(COALESCE(col, 0)) only if "missing means 0" is the business rule.
COUNT(DISTINCT col) forgotten after a JOIN that inflates rows — reports inflated counts.
Integer division — 5 / 100 = 0 in PostgreSQL. Cast one operand to NUMERIC or FLOAT.
Division by zero — NULLIF(denom, 0) converts 0 to NULL, so the division returns NULL instead of erroring.

PostgreSQL Interview Question on Duplicate Emails

Given users(id, email), return every email that appears more than once, along with the number of copies.

Solution Using `GROUP BY email HAVING COUNT(*) > 1`

SELECT email,
       COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY n_copies DESC, email;

Why this works: GROUP BY email collapses every row with the same email into a single bucket; COUNT(*) counts how many rows fell into each bucket; HAVING COUNT(*) > 1 keeps only buckets with at least two rows; ORDER BY n_copies DESC, email produces a deterministic, reviewer-friendly output. Single pass over users; sort cost dominates only when email cardinality is huge.

Step-by-step trace for the sample input:

id	email
1	alice@example.com
2	bob@example.com
3	alice@example.com
4	carol@example.com
5	bob@example.com
6	alice@example.com

FROM users — read all six rows.
No WHERE — every row passes.
GROUP BY email — three buckets: alice (3 rows), bob (2 rows), carol (1 row).
COUNT(*) — 3, 2, 1 respectively.
HAVING COUNT(*) > 1 — drops the carol bucket.
ORDER BY n_copies DESC, email — alice (3), then bob (2).

Output:

email	n_copies
alice@example.com	3
bob@example.com	2

Why this works — concept by concept:

GROUP BY email — collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters.
COUNT(*) — counts every row in the bucket, perfect for "how many copies".
HAVING COUNT(*) > 1 — group-level filter; the aggregate predicate must live here, not in WHERE. This is the precise interview signal for duplicate detection.
ORDER BY n_copies DESC, email — deterministic ordering; tie-broken by email so the output is stable across runs.
O(|users| + G log G) time — single hash aggregation produces G group rows; the final sort is G log G. With an index on email, the planner may use stream aggregation and skip the hash step.

Inline CTA: Drill the SQL aggregation practice page for GROUP BY and HAVING shapes, and the SQL filtering practice page for WHERE vs HAVING distinctions.

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →

4. PostgreSQL Window Functions — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`

Ranking, top-N-per-group, running totals, and lookback in PostgreSQL window functions

"Find the second-highest distinct salary" and "compute a running total of daily revenue" are the two signature window-function prompts — and both reduce to a window function with OVER (PARTITION BY … ORDER BY …). The mental model: a window function computes a value across a window of rows related to the current row without collapsing the rows like GROUP BY does; OVER (PARTITION BY col) defines the window boundary; OVER (ORDER BY col) defines the order within the window. ROW_NUMBER assigns unique integers; RANK skips after ties (1, 2, 2, 4); DENSE_RANK does not skip (1, 2, 2, 3); LAG looks back; LEAD looks forward; SUM/AVG/COUNT(...) OVER (...) compute running totals and moving averages.

Pro tip: Window functions cannot be referenced in WHERE of the same SELECT because they execute after WHERE. Wrap the window in a CTE or subquery, then filter on the alias. The error column "rn" does not exist after writing WHERE rn = 1 almost always means you forgot this rule.

`ROW_NUMBER` — unique sequential numbering per partition

The ROW_NUMBER invariant: ROW_NUMBER() OVER (PARTITION BY p ORDER BY o) assigns a unique integer 1, 2, 3, … to every row inside each partition p, ordered by o; ties in o are broken arbitrarily by the planner. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication (keep rn = 1 per business key).

OVER (PARTITION BY …) — bucket the rows; without this, the window is the whole result set.
OVER (ORDER BY …) — order within the bucket; required for ROW_NUMBER/RANK/LAG/LEAD.
Ties broken arbitrarily — add a tiebreaker column to ORDER BY for determinism.
Top-N-per-group — WHERE rn <= N after ROW_NUMBER; works only when ties at rank N can be ignored.

Worked example. employees with three engineers; rank by salary descending.

department	name	salary	row_number
eng	Alice	90,000	1
eng	Bob	80,000	2
eng	Carol	80,000	3

Bob and Carol tie on salary; ROW_NUMBER still gives them unique ranks (planner-chosen unless you add a tiebreaker).

Step-by-step explanation.

PARTITION BY department defines the boundary — only eng rows are compared with each other; if there were a sales partition it would have its own 1, 2, 3 sequence.
ORDER BY salary DESC, name orders rows within the partition: Alice (90,000) first, then Bob and Carol (tied at 80,000) broken by name.
ROW_NUMBER() assigns 1, 2, 3 sequentially regardless of ties; Bob gets 2 and Carol gets 3 because name breaks the tie.
Without the , name tiebreaker, Bob/Carol order is undefined — two query runs could swap them.
To deduplicate a table that has multiple rows per (business_key, source_ts), use ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1 to keep the latest.

Worked-example solution.

SELECT department, name, salary,
       ROW_NUMBER() OVER (
         PARTITION BY department
         ORDER BY salary DESC, name
       ) AS rn
FROM employees;

Rule of thumb: ROW_NUMBER is the right tool for deduplication (WHERE rn = 1) and for ordered streams; reach for RANK or DENSE_RANK when ties must be honored.

`RANK` vs `DENSE_RANK` — tie semantics

The rank-vs-dense-rank invariant: both assign the same rank to tied rows; RANK then skips the next k-1 ranks (gap), while DENSE_RANK continues without a gap. For "find the Nth distinct value" questions, DENSE_RANK = N is the correct filter; for "find the Nth row in skip-aware ranking order", RANK = N is correct.

RANK — 1, 2, 2, 4 — skips after ties.
DENSE_RANK — 1, 2, 2, 3 — no skip.
ROW_NUMBER — 1, 2, 3, 4 — never ties.
Pick by semantics — "Nth highest distinct salary" → DENSE_RANK = N; "Nth-ranked row in skip ordering" → RANK = N.

Worked example. Four employees; Bob and Carol tied at second-highest salary.

name	salary	rank	dense_rank	row_number
Alice	90,000	1	1	1
Bob	80,000	2	2	2
Carol	80,000	2	2	3
Dan	70,000	4	3	4

RANK jumps 2 → 4 (skipping 3); DENSE_RANK continues 2 → 3 (no skip).

Step-by-step explanation.

All three window functions agree on Alice (rank 1) because she is alone at the top.
Bob and Carol both get rank = 2 and dense_rank = 2 because they tie on salary; row_number gives them distinct values 2 and 3.
Dan is the next-lowest salary; RANK skips ahead by the number of tied rows (2 tied → next rank is 2 + 2 = 4); DENSE_RANK continues with no gap (3).
For "second highest distinct salary", DENSE_RANK = 2 correctly returns 80,000; RANK = 2 would also work here, but RANK would not return 80,000 if three people tied for first (it would skip to 4).
For "top 3 distinct salaries", use DENSE_RANK <= 3 — it returns Alice, Bob, Carol, Dan (four rows because Bob/Carol both have dr = 2).

Worked-example solution.

SELECT name, salary,
       RANK()       OVER (ORDER BY salary DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dr,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees;

Rule of thumb: "second highest salary" → DENSE_RANK = 2; "top 3 distinct salaries" → DENSE_RANK <= 3; never use RANK for these unless the spec explicitly says ties should consume rank slots.

`LAG`, `LEAD`, and running totals — lookback, lookahead, and `SUM(...) OVER (...)`

The lookback-and-running invariant: LAG(col, n) returns the value of col n rows back within the partition (default n=1); LEAD(col, n) is the symmetric forward; SUM(col) OVER (PARTITION BY p ORDER BY o) produces a running total within each partition. These three primitives drive month-over-month deltas, sessionization, running balances, and moving averages.

LAG(amount) OVER (ORDER BY date) — previous day's amount.
LEAD(amount) OVER (ORDER BY date) — next day's amount.
amount - LAG(amount) OVER (ORDER BY date) — day-over-day delta.
SUM(amount) OVER (ORDER BY date) — running total from start of partition through current row.

Worked example. Three days of sales; compute previous-day amount, day-over-day delta, and running total.

sales_date	amount	prev_amount	dod_delta	running_total
2026-05-09	100	NULL	NULL	100
2026-05-10	130	100	30	230
2026-05-11	120	130	-10	350

The first day has LAG = NULL because no prior row exists; consumers usually COALESCE(delta, 0) for display.

Step-by-step explanation.

LAG(amount) OVER (ORDER BY sales_date) returns the previous row's amount, ordered by date.
Day 1 (May 9): no previous row, so LAG = NULL; amount - LAG = NULL.
Day 2 (May 10): LAG = 100; delta = 130 - 100 = 30.
Day 3 (May 11): LAG = 130; delta = 120 - 130 = -10.
SUM(amount) OVER (ORDER BY sales_date) accumulates from the start of the partition through the current row: 100, 230, 350.

Worked-example solution.

SELECT sales_date,
       amount,
       LAG(amount) OVER (ORDER BY sales_date)            AS prev_amount,
       amount - LAG(amount) OVER (ORDER BY sales_date)   AS dod_delta,
       SUM(amount) OVER (ORDER BY sales_date)            AS running_total
FROM sales
ORDER BY sales_date;

Rule of thumb: LAG for "compare this row to its predecessor" (delta, retention, gap); LEAD for "what happens next" (sessionization, churn-from-here); SUM(...) OVER (...) for running totals — always PARTITION BY the entity if the table holds multiple series.

Common beginner mistakes

Using RANK when the question wants the Nth distinct value — RANK = 2 skips entirely if two rows tie for first.
Forgetting PARTITION BY for a per-group ranking — produces a global ranking instead of per-department.
Referencing the window-function alias in WHERE of the same SELECT — window functions execute after WHERE; wrap in a CTE or subquery first.
Confusing LAG (previous) with LEAD (next) — quietly produces inverted deltas.
Forgetting ORDER BY inside OVER for ROW_NUMBER/RANK/LAG/LEAD — required; the result is non-deterministic without it.

PostgreSQL Interview Question on Top 3 Salaries Per Department

Given employees(emp_id, name, department, salary), return the top 3 distinct salaries per department, with ties at rank 3 included. Output department, name, salary, and the rank.

Solution Using `DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)` in a CTE

WITH ranked AS (
    SELECT department,
           name,
           salary,
           DENSE_RANK() OVER (
               PARTITION BY department
               ORDER BY salary DESC
           ) AS dr
    FROM employees
)
SELECT department, name, salary, dr
FROM ranked
WHERE dr <= 3
ORDER BY department, dr, name;

Why this works: the CTE ranked materializes a per-department DENSE_RANK keyed by salary descending — dr = 1 is the highest distinct salary in that department, dr = 2 is the second-highest, and so on; the outer WHERE dr <= 3 keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the ORDER BY produces a deterministic, reviewer-friendly output. DENSE_RANK over RANK because the spec wants the top three distinct salaries; DENSE_RANK over ROW_NUMBER because ties at rank 3 must be retained.

Step-by-step trace for the sample input:

emp_id	name	department	salary
1	Alice	eng	90,000
2	Bob	eng	80,000
3	Carol	eng	80,000
4	Dan	eng	70,000
5	Eve	eng	60,000
6	Frank	sales	100,000
7	Grace	sales	90,000
8	Heidi	sales	80,000

CTE ranked — partition by department; order by salary DESC.
DENSE_RANK per partition — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3.
Outer WHERE dr <= 3 — drops Eve (dr = 4); keeps both Bob and Carol (tied at 2) and Dan (3).
ORDER BY department, dr, name — eng rows first, then sales; within department by dr, then name for tiebreak.

Output:

department	name	salary	dr
eng	Alice	90000	1
eng	Bob	80000	2
eng	Carol	80000	2
eng	Dan	70000	3
sales	Frank	100000	1
sales	Grace	90000	2
sales	Heidi	80000	3

Why this works — concept by concept:

CTE ranked — names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery.
PARTITION BY department — restarts the rank at each department boundary; without this, the rank is global and the answer is wrong.
ORDER BY salary DESC — defines "highest first" inside each partition; required for any deterministic ranking.
DENSE_RANK over RANK — the spec wants the top three distinct salaries; RANK would skip after ties and miss the third distinct salary if there is a two-way tie above it.
WHERE dr <= 3 in the outer — window functions cannot be referenced in WHERE of the same SELECT; the CTE provides the materialized column the outer can filter on.
O(N log N) time — sort within each partition dominates; with an index on (department, salary DESC) the planner can stream rather than sort.

Inline CTA: More SQL window-function practice problems and SQL CTE practice problems on PipeCode.

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — CTE
SQL CTE problems

Practice →

SQL
Topic — date functions
SQL date-function problems

Practice →

Tips to use this PostgreSQL cheat sheet effectively

Hold the clause-order diagram in your head

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. Memorize this sentence and 80% of "weird" PostgreSQL parse errors decode themselves in five seconds. The error column "x" does not exist almost always means you referenced a SELECT alias in WHERE; the error aggregate functions are not allowed in WHERE means you wanted HAVING instead.

State the grain before any `JOIN`

Before writing the JOIN, name the grain you're producing: "this is order-line grain", "this is customer-day grain", "this is (customer, product) grain". The single most common bug in analytical SQL is SUM(left.col) after a 1:N join — the metric is silently multiplied by N. If grain doubles, you'll spot it immediately.

Use `LEFT JOIN ... IS NULL` over `NOT IN` for anti-joins

NOT IN (subquery) returns zero rows when the subquery contains a single NULL because x NOT IN (..., NULL, ...) is NULL, which fails the WHERE predicate. LEFT JOIN ... WHERE right.id IS NULL and NOT EXISTS (...) are immune. Production engineers who have been burned once never write NOT IN again.

Pick `DENSE_RANK` for "Nth distinct"; pick `ROW_NUMBER` for deduplication

The single most-graded ranking distinction: DENSE_RANK = N is the Nth distinct value; RANK = N is the Nth row in skip-aware ranking order; ROW_NUMBER = N is the Nth row in arbitrary order. For "second-highest distinct salary" → DENSE_RANK = 2. For "remove duplicate rows keeping the canonical one" → ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1.

Use `FILTER (WHERE …)` for one-pass conditional metrics

SUM(amount) FILTER (WHERE status = 'refunded') is cleaner than SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) — PostgreSQL supports both. Use FILTER in PostgreSQL-only code, CASE for cross-dialect portability. One scan, many metrics, half the cost of two queries.

Always `ORDER BY` + tiebreaker; pair `LIMIT` with `ORDER BY`

Window functions, LIMIT N, and "top result" queries all require an ORDER BY with a deterministic tiebreaker (e.g., ORDER BY salary DESC, name). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering.

Use PostgreSQL-specific helpers — `EXTRACT`, `DATE_TRUNC`, `INTERVAL`, `::DATE` cast

EXTRACT(MONTH FROM ts), DATE_TRUNC('month', ts), ts - INTERVAL '1 month', ts::DATE. These four cover 95% of date arithmetic. Reach for DATE_TRUNC whenever the spec says "by month" or "by week" — it groups timestamps to the bucket boundary deterministically.

Where to practice on PipeCode

Start with the SQL practice surface for the all-language SQL corpus. Drill the four-primitive pages: SQL filtering for WHERE patterns, SQL joins for join shapes, SQL aggregation for GROUP BY + HAVING, SQL window functions for ranking and lookback. Add adjacent topics: SQL CTE, SQL subqueries, SQL null-handling, SQL date functions. The interview courses page bundles structured curricula — start with SQL for Data Engineering Interviews — From Zero to FAANG. For broader coverage, browse by topic or read the related SQL interview questions for data engineering and data lake architecture for data engineering interviews blogs.

Frequently Asked Questions

What is the logical clause order in a PostgreSQL query?

PostgreSQL evaluates clauses in the order FROM / JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT / OFFSET, regardless of the order you write them. This is why WHERE cannot reference aggregate functions (they don't exist until after GROUP BY) and why SELECT-level aliases cannot be referenced in WHERE (they're computed in stage 5). Aliases become available only in ORDER BY and the outer query in a nested context.

What is the difference between `WHERE` and `HAVING` in PostgreSQL?

WHERE filters individual rows before the GROUP BY step and can reference only raw row columns. HAVING filters whole groups after the GROUP BY step and can reference aggregate functions like COUNT(*), SUM(col), AVG(col). Trying to use an aggregate in WHERE (e.g., WHERE COUNT(*) > 1) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query.

How do I find rows in table A that have no match in table B?

The canonical PostgreSQL pattern is SELECT a.* FROM a LEFT JOIN b ON b.fk = a.pk WHERE b.pk IS NULL — the LEFT JOIN preserves every left row, and the WHERE b.pk IS NULL filter keeps only the ones where no right-side match was found. This is the anti-join pattern. Equivalent to WHERE NOT EXISTS (SELECT 1 FROM b WHERE b.fk = a.pk). Both are safer than NOT IN (subquery), which returns zero rows if the subquery contains a single NULL.

What is the difference between `RANK`, `DENSE_RANK`, and `ROW_NUMBER`?

All three assign integers within a window. ROW_NUMBER gives every row a unique sequential integer (1, 2, 3, 4), even on ties. RANK gives tied rows the same rank but skips after them (1, 2, 2, 4). DENSE_RANK gives tied rows the same rank with no skip (1, 2, 2, 3). For "Nth distinct value" use DENSE_RANK = N; for "Nth row in skip-aware ranking order" use RANK = N; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use ROW_NUMBER = 1.

What does `FILTER (WHERE …)` do in PostgreSQL aggregates?

SUM(col) FILTER (WHERE pred) and COUNT(*) FILTER (WHERE pred) apply the aggregate only to rows where the predicate is TRUE; rows where the predicate is FALSE or NULL are skipped for that aggregate, while other aggregates in the same SELECT still see them. The portable cross-dialect equivalent is SUM(CASE WHEN pred THEN col ELSE 0 END) and COUNT(CASE WHEN pred THEN 1 END). Use FILTER for clarity in PostgreSQL-only code.

How do I compute a running total in PostgreSQL?

Use SUM(col) OVER (PARTITION BY p ORDER BY o) — the window aggregate accumulates from the start of each partition through the current row in the order defined by ORDER BY. Example: SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) gives a per-customer running total of order amounts ordered by date. Drop PARTITION BY for a single global running total.

Why is `LIMIT 5` returning different rows on different runs?

LIMIT without ORDER BY is non-deterministic — PostgreSQL returns whatever rows it sees first, which depends on the query plan, parallelism, and table physical layout. Always pair LIMIT with ORDER BY <col> DESC, <tiebreaker> so two runs return the same rows. Reviewers depend on stable ordering, and dashboards break silently when row order drifts.

Top PostgreSQL SQL cheat sheet topics

1. PostgreSQL Logical Clause Order — FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

The seven-stage evaluation order every PostgreSQL query follows

FROM and JOIN — build the working row set

WHERE — row-level predicates before grouping

GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

Common beginner mistakes

PostgreSQL Interview Question on Clause Order

Solution Using WHERE + GROUP BY + HAVING in the Right Slots

2. PostgreSQL Joins and Grain — INNER, LEFT, RIGHT, FULL, SELF, CROSS

Joins, anti-joins, and the grain-inflation trap in PostgreSQL

INNER JOIN — keep only matching pairs (no padding)

LEFT JOIN — keep every left row, pad the right with NULLs (anti-join trick)

FULL OUTER, SELF, and CROSS joins — the rarer shapes

Common beginner mistakes

PostgreSQL Interview Question on Customers With No Orders

Solution Using LEFT JOIN ... WHERE orders.order_id IS NULL

3. PostgreSQL GROUP BY, HAVING, and Conditional Aggregates

GROUP BY with HAVING, FILTER, and CASE for one-pass metrics in PostgreSQL

COUNT, SUM, AVG, MIN, MAX — NULL-aware aggregates

WHERE vs HAVING — row filter vs group filter

FILTER (WHERE …) and CASE — conditional aggregates

Common beginner mistakes

PostgreSQL Interview Question on Duplicate Emails

Solution Using GROUP BY email HAVING COUNT(*) > 1

4. PostgreSQL Window Functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

Ranking, top-N-per-group, running totals, and lookback in PostgreSQL window functions

ROW_NUMBER — unique sequential numbering per partition

RANK vs DENSE_RANK — tie semantics

LAG, LEAD, and running totals — lookback, lookahead, and SUM(...) OVER (...)

Common beginner mistakes

PostgreSQL Interview Question on Top 3 Salaries Per Department

Solution Using DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) in a CTE

Tips to use this PostgreSQL cheat sheet effectively

Hold the clause-order diagram in your head

State the grain before any JOIN

Use LEFT JOIN ... IS NULL over NOT IN for anti-joins

Pick DENSE_RANK for "Nth distinct"; pick ROW_NUMBER for deduplication

Use FILTER (WHERE …) for one-pass conditional metrics

Always ORDER BY + tiebreaker; pair LIMIT with ORDER BY

Use PostgreSQL-specific helpers — EXTRACT, DATE_TRUNC, INTERVAL, ::DATE cast

Where to practice on PipeCode

Frequently Asked Questions

What is the logical clause order in a PostgreSQL query?

What is the difference between WHERE and HAVING in PostgreSQL?

How do I find rows in table A that have no match in table B?

What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?

What does FILTER (WHERE …) do in PostgreSQL aggregates?

How do I compute a running total in PostgreSQL?

Why is LIMIT 5 returning different rows on different runs?

Start practicing PostgreSQL SQL problems

1. PostgreSQL Logical Clause Order — `FROM` → `WHERE` → `GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`

`FROM` and `JOIN` — build the working row set

`WHERE` — row-level predicates before grouping

`GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`

Solution Using `WHERE` + `GROUP BY` + `HAVING` in the Right Slots

2. PostgreSQL Joins and Grain — `INNER`, `LEFT`, `RIGHT`, `FULL`, `SELF`, `CROSS`

`INNER JOIN` — keep only matching pairs (no padding)

`LEFT JOIN` — keep every left row, pad the right with `NULL`s (anti-join trick)

`FULL OUTER`, `SELF`, and `CROSS` joins — the rarer shapes

Solution Using `LEFT JOIN ... WHERE orders.order_id IS NULL`

3. PostgreSQL `GROUP BY`, `HAVING`, and Conditional Aggregates

`GROUP BY` with `HAVING`, `FILTER`, and `CASE` for one-pass metrics in PostgreSQL

`COUNT`, `SUM`, `AVG`, `MIN`, `MAX` — `NULL`-aware aggregates

`WHERE` vs `HAVING` — row filter vs group filter

`FILTER (WHERE …)` and `CASE` — conditional aggregates

Solution Using `GROUP BY email HAVING COUNT(*) > 1`

4. PostgreSQL Window Functions — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`

`ROW_NUMBER` — unique sequential numbering per partition

`RANK` vs `DENSE_RANK` — tie semantics

`LAG`, `LEAD`, and running totals — lookback, lookahead, and `SUM(...) OVER (...)`

Solution Using `DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)` in a CTE

State the grain before any `JOIN`

Use `LEFT JOIN ... IS NULL` over `NOT IN` for anti-joins

Pick `DENSE_RANK` for "Nth distinct"; pick `ROW_NUMBER` for deduplication

Use `FILTER (WHERE …)` for one-pass conditional metrics

Always `ORDER BY` + tiebreaker; pair `LIMIT` with `ORDER BY`

Use PostgreSQL-specific helpers — `EXTRACT`, `DATE_TRUNC`, `INTERVAL`, `::DATE` cast

What is the difference between `WHERE` and `HAVING` in PostgreSQL?

What is the difference between `RANK`, `DENSE_RANK`, and `ROW_NUMBER`?

What does `FILTER (WHERE …)` do in PostgreSQL aggregates?

Why is `LIMIT 5` returning different rows on different runs?