Posted on May 12

10 SQL changes. One took 30 seconds. It cut query time by 85%.

#ai #productivity #programming #sql

A 2015 research paper tested every tip against real data. Most developers have never seen it. The numbers are hard to ignore.

You wrote a query. It returned the right data. You moved on.

That’s the whole story for most developers. The query works, the feature ships, and nobody looks back. Until a senior engineer pulls up the execution plan in a prod review and asks why you’re doing a full table scan on 2 million rows to return twelve records.

That moment has happened to more engineers than will ever admit it publicly.

Here’s the thing: most SQL slowness isn’t mysterious. It’s not a missing index, a misconfigured database, or a vendor problem. It’s patterns habits that looked fine when the table had 500 rows and became quietly catastrophic when it hit 5 million.

In 2015, a researcher named Jean Habimana published a paper through IJSTR titled “Query Optimization Techniques: Tips For Writing Efficient And Faster SQL Queries.” Five pages. Tested against Oracle’s sample Sales database. Every tip benchmarked with actual time reductions. It’s been sitting in academic obscurity ever since, which is a shame, because some of these changes take thirty seconds to make and show query time reductions above 80%.

This article is that paper, translated into something you’ll actually read.

TL;DR: 10 SQL habits ranked by impact. Time reductions range from 11% to 85%. Most fixes take under five minutes. A few will make you retroactively embarrassed about queries you shipped last year.

Why your query isn’t just “running”

Most SQL tutorials skip this part entirely. They go straight to syntax, joins, and GROUP BY, and leave you with a model of SQL that looks roughly like: you write it, the database runs it, data comes back. Clean. Simple. Wrong.

When you submit a query, it doesn’t execute directly. It goes through a query optimizer first a component that reads your SQL, estimates multiple ways to fetch the result, and picks the one it thinks is cheapest. Cheapest meaning least I/O, least memory, least CPU. The optimizer is making decisions you never see, and those decisions are heavily influenced by how you wrote the query.

Think of it like a GPS. You give it a destination and it figures out the route. But if you give it bad inputs a vague address, a restricted road it doesn’t know about it picks a worse path. The database optimizer works the same way. Write the query clearly and it finds the fast route. Write it carelessly and it takes the scenic route through every row in your table.

The execution path looks roughly like this:

Your SQL
    ↓
Query Parser — checks syntax
    ↓
Query Optimizer — estimates cheapest execution path
    ↓
Execution Plan — the actual instruction set
    ↓
Data retrieval → Result

Every tip in this article targets one of two things: either helping the optimizer make a better decision, or removing work it shouldn’t have to do in the first place. That’s the whole framework. Keep it in mind as you read the rest.

The lazy habits costing you the most

Let’s start with the number that stopped me mid-scroll when I first read this paper: 85% query time reduction. Not from adding an index. Not from upgrading hardware. From removing one word.

Tip 3 first because 85% deserves the spotlight

When you write SELECT DISTINCT on a join where the primary key is already in the result, duplicates are mathematically impossible. There are no duplicates to remove. But the database doesn't know that so it sorts the entire result set and checks anyway. You're paying full deduplication cost for zero benefit.

-- Slow: DISTINCT is doing nothing here (primary key present)
SELECT DISTINCT  FROM sales s
JOIN customers c ON s.cust_id = c.cust_id;

-- Fast: just don't
SELECT  FROM sales s
JOIN customers c ON s.cust_id = c.cust_id;

85% faster. One word removed.

Tip 1: SELECT columns, not SELECT * 27% reduction

The most universal bad habit in SQL. When you write SELECT *, the database fetches every column and ships it across the network. If you need two columns from a table with twenty, you're paying for eighteen you'll never use. If any of those columns are large types TEXT, BLOB, JSON you're paying a lot.

-- Slow
SELECT  FROM sales;

-- Fast
SELECT prod_id, cust_id FROM sales;

It feels pedantic until your table has 40 columns and half of them are storing documents.

Tip 2: WHERE before GROUP BY, not HAVING 31% reduction

HAVING filters rows after grouping. WHERE filters rows before grouping. If your condition doesn't involve an aggregate function, it has no business being in HAVING. Putting it there means the database groups every row first, then throws away the ones you didn't want work it never needed to do.

-- Slow: groups everything, then filters
SELECT cust_id, COUNT(cust_id)
FROM sales
GROUP BY cust_id
HAVING cust_id != '1660';

-- Fast: filters first, groups less
SELECT cust_id, COUNT(cust_id)
FROM sales
WHERE cust_id != '1660'
GROUP BY cust_id;

Tip 3: Drop unnecessary DISTINCT 85% reduction

Here it is. The biggest number in the paper.

-- Slow: DISTINCT is doing nothing here (primary key present)
SELECT DISTINCT  FROM sales s
JOIN customers c ON s.cust_id = c.cust_id;

-- Fast: just don't
SELECT  FROM sales s
JOIN customers c ON s.cust_id = c.cust_id;

85% faster. One word removed.

Tip 4: Un-nest your subqueries 61% reduction

Correlated subqueries are the silent performance killer most junior devs don’t recognize until it’s too late. A subquery inside a WHERE clause that references the outer query runs once per row. Not once total once per row. On a table with 100,000 rows, that's 100,000 executions of your inner query.

A JOIN runs once.

-- Slow: subquery executes for every row in products
SELECT  FROM products p
WHERE p.prod_id = (
    SELECT s.prod_id FROM sales s
    WHERE s.cust_id = 100996
);

-- Fast: single join operation
SELECT p. FROM products p
JOIN sales s ON p.prod_id = s.prod_id
WHERE s.cust_id = 100996;

If you’ve ever watched a query go from instant to “still running after 40 seconds” as the table grew, a correlated subquery somewhere is usually the culprit.

Four tips. Average time reduction across all four: 51%. And none of them required touching your schema, your indexes, or your infrastructure. Just the query.

The optimizer killers

These four are sneakier. The previous section was mostly about obvious waste fetching columns you don’t need, grouping rows you’re about to throw away. This section is about patterns that look perfectly reasonable but quietly prevent the optimizer from using your indexes. That distinction matters a lot at scale.

Tip 5: IN instead of multiple ORs 73% reduction

This one surprises people. Both IN and OR feel like they're doing the same thing, and syntactically they kind of are. The difference is what the optimizer can do with them.

With an IN list, the optimizer can sort the values and match them against the index in order. With chained OR conditions, it can't apply that optimization it evaluates each condition independently, often falling back to a full scan.

-- Slow
WHERE prod_id = 14 OR prod_id = 17;

-- Fast
WHERE prod_id IN (14, 17);

Small change. The optimizer sees a completely different instruction. 73% faster.

Tip 6: EXISTS over DISTINCT on one-to-many joins 61% reduction

When you join a parent table to a child table in a one-to-many relationship and slap DISTINCT on it to collapse the duplicates, the database fetches every matching row from both tables and then deduplicates the whole thing. That's a lot of rows moved around just to throw most of them away.

EXISTS short-circuits. It checks whether a match exists and stops the moment it finds one. It never fetches the duplicates in the first place.

-- Slow: fetches everything, then deduplicates
SELECT DISTINCT c.country_id, c.country_name
FROM countries c
JOIN customers e ON e.country_id = c.country_id;

-- Fast: stops at first match per country
SELECT c.country_id, c.country_name
FROM countries c
WHERE EXISTS (
    SELECT 1 FROM customers e
    WHERE e.country_id = c.country_id
);

Once you understand the short-circuit behavior, you’ll see DISTINCT on joins differently. It's not wrong it's just usually the expensive way to ask a simple question.

Tip 7: UNION ALL over UNION 81% reduction

UNION deduplicates. UNION ALL doesn't. That's the entire difference, and it costs 81% of your query time when the data can't have duplicates anyway or when you simply don't care.

-- Slow: scans combined result for duplicates
SELECT cust_id FROM sales
UNION
SELECT cust_id FROM customers;

-- Fast: skips deduplication entirely
SELECT cust_id FROM sales
UNION ALL
SELECT cust_id FROM customers;

The rule is simple: if your data sources can’t produce duplicates by definition, or if downstream logic handles it, UNION ALL is strictly faster. Using UNION by default because it feels safer is leaving 81% on the table.

Tip 8: Split OR in JOIN conditions into UNION ALL 70% reduction

This is the least obvious one in the entire paper, and probably the most common silent killer in production codebases. An OR condition inside a JOIN prevents index usage on both sides. The optimizer sees it and gives up on the index entirely, falling back to a full scan of both tables.

The fix is to split it into two separate joins and combine them with UNION ALL.

-- Slow: OR blocks index usage on both columns
SELECT  FROM costs c
INNER JOIN products p
ON c.unit_price = p.prod_min_price
OR c.unit_price = p.prod_list_price;

-- Fast: two indexed joins, combined
SELECT  FROM costs c
INNER JOIN products p ON c.unit_price = p.prod_min_price
UNION ALL
SELECT  FROM costs c
INNER JOIN products p ON c.unit_price = p.prod_list_price;

It looks more verbose. It is more verbose. It’s also 70% faster because both joins can now use their indexes cleanly. Verbose and fast beats clean and slow every time a product manager asks why the report takes three minutes to load.

If you want to see exactly what your optimizer is doing with any of these patterns, MySQL’s optimizer trace and PostgreSQL’s EXPLAIN ANALYZE will show you the execution plan in detail and make the index abandonment painfully visible.

Four tips. Reductions of 73%, 61%, 81%, and 70%. All from patterns that look harmless in a code review because the query returns the right data. Correctness and performance are different problems, and SQL will let you solve one while completely ignoring the other.

The silent ones

These two don’t announce themselves. No error, no warning, no slow query log entry when the table is small. You write them, they work, and then six months later when the data has grown and something is mysteriously sluggish, nobody connects it back to these lines.

Tip 9: No functions on indexed columns in WHERE 70% reduction

This one is responsible for more silent performance regressions than almost anything else on this list. The logic feels completely reasonable when you write it: you need to filter by year, the column stores full dates, so you wrap it in EXTRACT() and move on.

-- Slow: function call prevents index usage
WHERE EXTRACT(YEAR FROM time_id) = 2001;

-- Fast: BETWEEN works with the index
WHERE time_id BETWEEN '01-JAN-2001' AND '31-DEC-2001';

Here’s what actually happens: the moment you wrap a column in a function inside a WHERE clause, the optimizer can't use the index on that column anymore. The index is organized around the raw column values. Once you transform those values with a function, the index has no idea how to help so it doesn't. The database computes EXTRACT() for every single row, then filters. Full scan. Every time.

The fix is to move the transformation to the value side, not the column side. Instead of asking “what year does this date belong to,” ask “does this date fall inside this year.” BETWEEN two known date boundaries does exactly that, leaves the column untouched, and lets the index do its job.

This applies beyond EXTRACT(). Any function wrapping an indexed column in a WHERE clause UPPER(), LOWER(), CAST(), DATE_TRUNC(), TO_CHAR() breaks index usage the same way. It's a category of mistake, not just one specific function.

70% reduction. Zero schema changes required.

Tip 10: Pre-calculate your math 11% reduction

This one is smaller 11% but it might be the most embarrassing item on the list once you understand what’s happening.

-- Slow: recalculates for every row
WHERE cust_id + 10000 < 35000;

-- Fast: constant evaluated once
WHERE cust_id < 25000;

When your WHERE clause contains arithmetic on a column, the database evaluates that expression for every row it scans. Not once every row. The value 10000 isn't changing between rows. You already know the answer is 25000. But you made the database do the same addition hundreds of thousands of times because you didn't do it yourself first.

Do the math before the query runs. It costs you nothing and saves the database from doing redundant arithmetic at scale.

11% might sound modest compared to the numbers earlier in this article. But this is also the change that takes literally ten seconds. There’s no tradeoff to weigh, no refactor to plan. Just move the arithmetic outside the query. If you’re leaving 11% on the table because the fix felt too small to bother with, that’s a habit worth breaking.

Both of these fall into the same category: the optimizer wanted to help, and the way the query was written made that impossible. The database didn’t fail it did exactly what you asked. You just didn’t realize what you were asking.

The pre-ship checklist

Before you push that query whether it’s going into an ORM, a stored procedure, a data pipeline, or directly into prod run it through this. Seven questions. Takes thirty seconds.

SELECT * anywhere?          → Specify only the columns you need
Filtering in HAVING?        → Move it to WHERE if no aggregate involved
Any DISTINCT?               → Do you actually need it, or is the key already there?
Nested subquery?            → Can it be rewritten as a JOIN?
OR in WHERE or JOIN?        → Try IN or UNION ALL instead
Function wrapping a column? → Move the transformation to the value side
Math on a column?           → Pre-calculate it before the query runs

None of these require a DBA, a schema change, or a ticket. They’re query-level habits. The kind that separate code that works from code that scales.

Bookmark this. Drop it in your team’s wiki. Put it in your CLAUDE.md if you're using Claude Code. Stick it somewhere you'll actually see it before the slow query alert fires at an inconvenient hour.

For deeper reading on index behavior and execution plans, Use The Index, Luke is the best free resource on the internet for this topic. No fluff, just mechanics.

The queries you wrote last year are probably still running

The paper this article is based on was published in 2015. The database you’re querying today almost certainly has rows that didn’t exist then. The query you wrote last year, maybe last month, is probably still executing exactly as you wrote it habits intact, DISTINCT in place, SELECT * quietly fetching columns nobody asked for.

That’s the uncomfortable part. These aren’t exotic edge cases. They’re patterns that pass code review because the query returns correct results, and correctness is usually all anyone checks. Performance is someone else’s problem until it becomes everyone’s problem at 11pm on a Tuesday.

The slightly spicy take: most slow SQL isn’t a database problem. It’s a habits problem. The optimizer is genuinely trying to help you it’s building execution plans, estimating costs, looking for index paths. What it can’t do is save you from instructions that actively prevent it from doing its job. Functions on indexed columns, OR conditions in joins, DISTINCT on results that can’t have duplicates these aren’t bugs the database can route around. They’re the query working exactly as written.

The good news is that optimizers are getting smarter. PostgreSQL 16 brought improvements to parallel query execution and partition pruning. AI-assisted query analysis tools are starting to surface execution plan issues automatically. The tooling is moving in the right direction.

But none of it will save you from SELECT *.

The changes in this article aren’t framework upgrades or infrastructure investments. They’re rewrites. Most take under five minutes. Some show reductions above 80%. The paper tested them on Oracle but the underlying optimizer logic holds across PostgreSQL, MySQL, and SQL Server the principles are the same.

Go check what your slowest queries are doing. There’s a reasonable chance one of these seven checklist items is in there.

And if you’ve got a SQL habit that you’re mildly ashamed of a correlated subquery you know you should rewrite, a SELECT * you keep meaning to fix

drop it in the comments. No judgment. We’ve all shipped something we’d rather not explain.

Resources

Original paper: Jean Habimana, “Query Optimization Techniques: Tips For Writing Efficient And Faster SQL Queries”, IJSTR Vol. 4, Issue 10, October 2015
Use The Index, Luke index mechanics explained without the academic fog
PostgreSQL query planning docs how PG’s optimizer actually works
MySQL optimizer trace see exactly what MySQL is doing with your query
explain.dalibo.com visual EXPLAIN ANALYZE for Postgres, free and genuinely useful

DEV Community