Gowtham Potureddi

Posted on May 11

Data Engineering Roadmap for Freshers (2026): A 13-Step Beginner's Guide from SQL to Your First Data Engineering Job

#python #sql #interview #dataengineering

Data engineering is one of the fastest-growing tech careers in 2026. Companies collect huge amounts of data every day, and data engineers build the systems that collect, clean, transform, store, and deliver that data so analysts, scientists, and product teams can use it. If you're a fresher and confused about where to start, this data engineering roadmap for freshers lays out a clear, ordered 13-step path — what to learn first, what to learn next, what to build, and how to prove the work to a recruiter.

This guide is a beginner-first walkthrough for how to become a data engineer in 2026 without a CS degree, three certificates, or a Spark cluster on day one. The 13 steps are grouped into five learning blocks below, each with a tiny worked example you can run on your laptop. Most freshers fail because they jump to Spark too early, ignore SQL depth, avoid projects, or watch tutorials without practising — the roadmap below fixes all four. Examples use PostgreSQL SQL (the dialect every coding-environment interview defaults to) and standard-library Python so you can run everything on a laptop without setup overhead. Default plan: about 6–9 months at 10–15 hours per week to be job-ready, 9–12 months at 6–8 hours per week for working learners.

1. Step 1 — Master SQL: The Most Important Skill for a Data Engineer

SQL fundamentals, joins, aggregations, window functions, and the queries you'll write every day

SQL is the foundation of data engineering — you'll write it daily for querying, cleaning, transforming, joining datasets, building reports, and writing ETL logic. Master SQL first; everything else becomes easier.

The five SQL skill clusters every fresher needs:

Basics — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT.
Aggregations — COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING.
Joins — INNER, LEFT, RIGHT, FULL, SELF.
Window functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD.
Advanced — CTEs, subqueries, CASE, NULL handling, date functions, indexes, query optimisation.

Pro tip: SQL is non-negotiable. Drill it daily on a free coding environment (DataLemur, LeetCode SQL, StrataScratch, HackerRank SQL). Most fresher rejections at the SQL screen are not from missing syntax — they are from joining at the wrong grain or putting an aggregate in the wrong clause.

SQL basics — `SELECT`, `WHERE`, `ORDER BY`, `LIMIT`, `DISTINCT`

The bedrock SQL shape: SELECT cols FROM table WHERE row_filter ORDER BY col DESC LIMIT N. That one query covers most "show me the top X by Y" prompts you'll ever write.

SELECT cols FROM table — pick the columns you actually need; never SELECT * in production.
WHERE filter — row-level predicate; runs before grouping.
ORDER BY col DESC — sort the result; ASC is default, DESC is biggest-first.
LIMIT N — keep only the top N rows.
DISTINCT col — collapse duplicates to a single value per group.

Example input. A 4-row employees table with name and salary.

name	salary
Alice	70000
Bob	45000
Carol	90000
Dan	55000

Question. Return the names and salaries of employees who earn more than 50,000, sorted from highest to lowest salary.

Code solution.

SELECT name, salary
FROM employees
WHERE salary > 50000
ORDER BY salary DESC;

Explanation of code. WHERE salary > 50000 runs first and drops Bob (45000). The remaining three rows are then sorted by salary in descending order, so Carol (highest) comes first, Alice second, Dan third. No LIMIT, so all three qualifying rows are returned.

Step-by-step.

step	clause	result
1	`FROM employees`	scan all 4 rows
2	`WHERE salary > 50000`	drop Bob (45000); 3 rows left
3	`ORDER BY salary DESC`	sort: Carol (90000) → Alice (70000) → Dan (55000)
4	`SELECT name, salary`	project the two named columns

Output.

name	salary
Carol	90000
Alice	70000
Dan	55000

Rule of thumb: always name the columns in the SELECT; SELECT * outside an exploratory REPL is a code smell.

Aggregations — `GROUP BY` + `HAVING`

The aggregation shape: SELECT dim, AGG(col) FROM table GROUP BY dim HAVING AGG_filter. GROUP BY collapses many rows to one row per group; HAVING filters the resulting groups (you cannot put an aggregate in WHERE).

COUNT(*) — number of rows per group.
SUM(col) / AVG(col) / MIN(col) / MAX(col) — collapse a numeric column.
GROUP BY dim — one output row per distinct value of dim.
HAVING AGG > N — keep only groups whose aggregate exceeds N.

Example input. A 6-row employees table with department and salary.

name	department	salary
Alice	Engineering	90000
Bob	Engineering	80000
Carol	Sales	50000
Dan	Sales	55000
Eve	Marketing	65000
Frank	Marketing	60000

Question. Return the average salary per department, but only show departments whose average exceeds 60,000.

Code solution.

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;

Explanation of code. GROUP BY department collapses the six rows into three groups — Engineering, Sales, Marketing. AVG(salary) computes the per-group average: Engineering 85000, Sales 52500, Marketing 62500. HAVING AVG(salary) > 60000 then drops Sales (52500 fails the threshold) and keeps the other two.

Step-by-step.

step	clause	result
1	`FROM employees`	scan all 6 rows
2	`GROUP BY department`	3 groups — Engineering (2 rows), Sales (2 rows), Marketing (2 rows)
3	`AVG(salary)` per group	Engineering 85000, Sales 52500, Marketing 62500
4	`HAVING AVG(salary) > 60000`	drop Sales (52500 fails); keep Engineering + Marketing
5	`SELECT department, AVG(salary) AS avg_salary`	project the 2 surviving rows

Output.

department	avg_salary
Engineering	85000
Marketing	62500

Rule of thumb: row predicates → WHERE; aggregate predicates → HAVING. Putting AVG > X in WHERE is a parse error.

Joins — connecting tables on a common key

Joins combine columns from two tables on a matching key. The four every fresher needs: INNER (only matched rows survive), LEFT (all rows from the left table, even unmatched), RIGHT (mirror of LEFT, rarely used), FULL (all rows from both sides). SELF JOIN joins a table to itself for hierarchies (manager / employee, parent / child).

INNER JOIN — strict match on both sides.
LEFT JOIN — keep every left row; NULL on the right when no match.
RIGHT JOIN — same as LEFT with sides swapped; usually rewrite as LEFT.
FULL JOIN — keep every row from both sides; useful for reconciliation.
SELF JOIN — alias the same table twice (employees a JOIN employees b ON a.manager_id = b.id).

Example input. An orders table and a customers table.

orders:

order_id	customer_id
101	C1
102	C2
103	C1

customers:

customer_id	customer_name
C1	Alice
C2	Bob

Question. Return one row per order showing order_id and the matching customer_name.

Code solution.

SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c
    ON o.customer_id = c.customer_id;

Explanation of code. The INNER JOIN (the default form when you just write JOIN) matches each order to its customer using customer_id. Order 101 → Alice, order 102 → Bob, order 103 → Alice. All three orders have a matching customer, so every order survives.

Step-by-step.

step	action	result
1	scan `orders` (left side)	3 rows: 101→C1, 102→C2, 103→C1
2	for each row, look up `customer_id` in `customers`	C1→Alice (twice), C2→Bob
3	`INNER JOIN` keeps only matched pairs	all 3 orders matched, 0 dropped
4	`SELECT o.order_id, c.customer_name`	project the 2 named columns

Output.

order_id	customer_name
101	Alice
102	Bob
103	Alice

Rule of thumb: always give every table a short alias (o, c) and prefix every column (o.order_id, c.customer_name) — the SQL becomes self-documenting.

Common beginner mistakes

Using SELECT * everywhere — production queries always name the columns.
Putting an aggregate in WHERE instead of HAVING — parse error in PostgreSQL.
Joining at the wrong grain (one-to-many without thinking) — the #1 source of "the number is suddenly 3× too high" bugs.
Memorising syntax without internalising which side keeps its rows in a LEFT JOIN — the part that breaks numbers.
Skipping window functions because they "look hard" — interviewers love them; they take a week to learn.

Worked Problem on Ranking Top Earners per Department with Window Functions

Example input. A 6-row employees table mixing departments and salaries.

name	department	salary
Alice	Engineering	90000
Bob	Engineering	80000
Carol	Sales	50000
Dan	Sales	55000
Eve	Marketing	65000
Frank	Marketing	60000

Question. Rank each employee by salary within their department (highest = rank 1) and return only the top earner per department. Use a window function — pure GROUP BY cannot keep both the rank and the row's other columns.

Solution Using `ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)`

Code solution.

SELECT department, name, salary
FROM (
    SELECT
        department,
        name,
        salary,
        ROW_NUMBER() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS rank
    FROM employees
) t
WHERE rank = 1;

Explanation of code. ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) assigns a strict 1, 2, 3 sequence within each department, ordered by salary from highest to lowest. The outer WHERE rank = 1 keeps only the top-paid row per department. The wrapping subquery is needed because PostgreSQL evaluates window functions after WHERE, so we cannot filter rank = 1 in the same level where we compute it.

Output.

department	name	salary
Engineering	Alice	90000
Marketing	Eve	65000
Sales	Dan	55000

Step-by-step trace for the input rows above:

department	name	salary	rank
Engineering	Alice	90000	1
Engineering	Bob	80000	2
Marketing	Eve	65000	1
Marketing	Frank	60000	2
Sales	Dan	55000	1
Sales	Carol	50000	2

After WHERE rank = 1: three rows — one per department, the top earner.

Why this works — concept by concept:

PARTITION BY department — defines the group inside which the ranking happens; without it, the rank would be global across all employees.
ORDER BY salary DESC — descending so rank 1 is the highest-paid; ascending would give the lowest.
ROW_NUMBER not RANK — strict 1, 2, 3; ties produce one rank-1 row per partition, which is what "top earner" demands.
Outer WHERE rank = 1 filter — Postgres cannot filter window-function output in the same query level; the wrap is required.
one row per department guaranteed — ROW_NUMBER (not RANK or DENSE_RANK) ensures no ties, so the result has exactly one row per group.
Cost — O(N log N) from the partitioned sort; with an index on (department, salary DESC) this becomes O(N).

Inline CTA: drill the SQL practice page for short curated reps; the structured path for fresher SQL is SQL for Data Engineering Interviews — From Zero to FAANG.

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

2. Step 2 — Learn Python for Data Engineering

Core Python, file handling, Pandas, and the API requests every DE writes

Python is the glue language for everything outside the database — ETL scripts, automation, data pipelines, API integrations, transformations. You don't need to be a Python wizard; you need to be fluent at reading CSVs, calling APIs, transforming data with Pandas, and writing small testable functions.

Three Python skill clusters every fresher needs:

Core Python — variables, loops, functions, lists / dicts / sets, classes, exception handling.
File handling — read and write CSV, JSON, and Excel files using the standard library and Pandas.
Libraries — Pandas for data transformation; Requests for API calls; PySpark later (Step 6) for big-data processing.

Pro tip: the 10% of Python you actually use day-to-day is csv, json, pathlib, collections, dataclasses, typing, and pandas. Skip metaclasses, descriptors, and async event loops on day one — they're irrelevant to fresher DE work.

Core Python — loops, lists, and small functions

The fresher Python invariant: write small testable functions over loops over lists / dicts. Type hints (def f(x: int) -> int:) make a 2-month-old script readable when you come back to it.

Variables and types — int, float, str, bool, None.
Lists, dicts, sets — ordered, key-value, unique-only.
Loops — for x in xs: over iterables.
Functions — single-responsibility; takes inputs, returns outputs.
Exception handling — try / except FileNotFoundError for fragile I/O.

Example input. A Python list of three integers.

data = [1, 2, 3]

Question. Multiply each number by 2 and print the result. Show the canonical for loop pattern that every other Python data-engineering script will mirror.

Code solution.

data = [1, 2, 3]

for num in data:
    print(num * 2)

Explanation of code. for num in data: walks the list one element at a time, binding the current value to num. Inside the loop body, num * 2 doubles the value and print(...) writes it to stdout. The pattern generalises directly to "for every row in this CSV, do something" — replace data with csv.DictReader(f) and you have an ETL skeleton.

Step-by-step.

iteration	`num`	`num * 2`	stdout
1	1	2	`2`
2	2	4	`4`
3	3	6	`6`
end	—	—	loop exits when list is exhausted

Output.

2
4
6

Rule of thumb: if your Python script grows past 100 lines and has zero functions, it's a notebook draft, not a script — refactor before sharing it.

File handling — reading CSV and JSON

Most data-engineering Python is reading a file, transforming the contents, and writing the result somewhere. The standard library has csv and json modules that cover 90% of fresher needs; for anything richer reach for Pandas.

open(path, encoding='utf-8') — open a text file safely.
csv.DictReader(f) — iterate CSV rows as dictionaries (column-name access).
json.load(f) — parse a JSON file into a Python dict / list.
pathlib.Path('file.csv') — modern path object; works on Windows, macOS, Linux.

Example input. A data.json file containing one JSON object.

{"name": "Alice", "salary": 70000}

Question. Open data.json, parse it into a Python dict, and print the parsed result.

Code solution.

import json

with open("data.json") as f:
    data = json.load(f)

print(data)

Explanation of code. with open("data.json") as f: opens the file safely (the with block guarantees the file is closed when the block exits, even on error). json.load(f) parses the file's contents into a Python object — a dict here because the JSON started with {. Printing the dict shows the parsed data.

Step-by-step.

step	action	result
1	`with open("data.json") as f`	file handle `f` opens in text mode
2	`json.load(f)` reads bytes	parses JSON object → Python `dict`
3	bind result to `data`	`data = {"name": "Alice", "salary": 70000}`
4	exit `with` block	file auto-closed (even on error)
5	`print(data)`	dict printed to stdout

Output.

{'name': 'Alice', 'salary': 70000}

Rule of thumb: always use with open(...) rather than the bare open() call — it auto-closes the file and handles exceptions cleanly.

Pandas for tabular data — `read_csv`, `groupby`, `sum`

Pandas is the Python library every DE uses for transforming tabular data. The three operations you'll do hundreds of times: read a CSV into a DataFrame, group by one or more columns, aggregate with sum / mean / count. Requests is the API-call counterpart.

pd.read_csv('file.csv') — read a CSV into a DataFrame.
df.groupby('col') — group rows by a column.
.sum() / .mean() / .count() — aggregate the groups.
requests.get(url).json() — fetch a URL and parse the JSON response.

Example input. A sales.csv file with 5 rows across two regions.

order_id	region	amount
1	North	100
2	North	150
3	South	80
4	South	120
5	North	70

Question. Read sales.csv into a Pandas DataFrame, group by region, and print the sum of amount per region.

Code solution.

import pandas as pd

df = pd.read_csv("sales.csv")
print(df.groupby("region").sum())

Explanation of code. pd.read_csv("sales.csv") loads the entire CSV into a DataFrame, with the first row treated as column headers. df.groupby("region") produces a grouped object that buckets rows by region. .sum() aggregates every numeric column within each bucket — here that's order_id (sum of IDs, usually meaningless) and amount (the metric we care about).

Step-by-step.

step	action	result
1	`pd.read_csv("sales.csv")`	DataFrame with 5 rows × 3 columns
2	`df.groupby("region")`	bucket rows: North = {1, 2, 5}; South = {3, 4}
3	`.sum()` per bucket	North: order_id sum = 8, amount = 320 (100+150+70); South: order_id sum = 7, amount = 200 (80+120)
4	`print(...)`	the two-row grouped frame prints to stdout

Output.

        order_id  amount
region
North          8     320
South          7     200

Rule of thumb: when the data fits in memory and you don't need a database, Pandas is faster than writing the SQL — but for anything past a few million rows, push the work back into SQL or PySpark.

Common beginner mistakes

Skipping type hints — code becomes unreadable in 2 months.
Reading huge CSVs into Pandas without chunksize — your laptop runs out of RAM.
Using requests without a timeout — a hung API call freezes your script forever (requests.get(url, timeout=10)).
Not handling None / missing values — int(None) crashes with a TypeError.
Writing 200-line scripts as one big block — break into def-defined functions.

Worked Problem on Building a CSV-to-Summary Python ETL Script

Example input. A sales.csv file with 5 rows.

order_id	region	amount
1	North	100
2	North	150
3	South	80
4	South	120
5	North	70

Question. Build a small Python ETL script that reads sales.csv, sums amount per region, writes the result to summary.csv, and prints the count of rows processed. This is the canonical Phase-1 portfolio script every fresher should ship to GitHub.

Solution Using Pandas + a writeable summary path

Code solution.

import pandas as pd
from pathlib import Path

def summarise_sales(input_path: Path, output_path: Path) -> int:
    df = pd.read_csv(input_path)
    summary = df.groupby("region", as_index=False)["amount"].sum()
    summary.to_csv(output_path, index=False)
    return len(df)

if __name__ == "__main__":
    rows = summarise_sales(Path("sales.csv"), Path("summary.csv"))
    print(f"processed {rows} rows")

Explanation of code. The function takes two Path objects so it's testable — you can call it from a test with mock paths instead of hardcoding filenames. pd.read_csv(input_path) loads the CSV, groupby("region", as_index=False)["amount"].sum() produces a clean two-column summary (as_index=False keeps region as a column rather than becoming the index), and to_csv(output_path, index=False) writes the summary back out without Pandas' default integer index column. The function returns the row count so the caller can log a clean status line.

Output.

summary.csv:

region	amount
North	320
South	200

stdout: processed 5 rows

Step-by-step trace for the input rows above:

step	action	result
1	`pd.read_csv("sales.csv")`	DataFrame with 5 rows × 3 columns
2	`df.groupby("region", as_index=False)["amount"].sum()`	2-row summary DataFrame
3	`summary.to_csv("summary.csv", index=False)`	file written to disk
4	`return len(df)`	returns `5`
5	`print(f"processed {rows} rows")`	stdout: `processed 5 rows`

Why this works — concept by concept:

Path objects for testable I/O — paths are inputs, not hardcoded constants, so the function works with any source / destination.
groupby(..., as_index=False) — keeps region as a regular column instead of the DataFrame index; the resulting CSV reads naturally.
["amount"].sum() — selects the metric column before aggregation; otherwise Pandas would also sum order_id, which is meaningless.
to_csv(..., index=False) — suppresses Pandas' default integer index column; the CSV has only the two columns you actually want.
return + print separation — the function returns a value (good for tests); the caller decides whether to print it (good for scripts vs imports).
Cost — O(N) where N is the input row count; fits in memory up to a few million rows.

Inline CTA: for fresher Python reps see Python practice page; the structured path is Python for Data Engineering Interviews — Complete Fundamentals.

PYTHON
Language — Python
Python practice problems

Practice →

COURSE
Course — Python for DE
Python for Data Engineering Interviews

View course →

SQL
Topic — CSV processing
SQL CSV-processing problems

Practice →

3. Steps 3-5 — Databases, Data Warehousing, and ETL/ELT

How data is stored, modeled, and moved through pipelines

Three closely-related steps in one section because they answer the same question: where does the data live, and how does it get there?

Step 3 — Databases. Relational (PostgreSQL, MySQL) for transactional workloads; NoSQL (MongoDB, Cassandra, Redis) for specialised cases. Learn keys, normalisation, transactions, indexing, ACID.
Step 4 — Data Warehousing. Snowflake, BigQuery, Redshift store analytics-ready data in fact tables + dimension tables, organised as a star schema (fact in the middle, dimensions hanging off). Heavily asked in interviews.
Step 5 — ETL / ELT. ETL = Extract → Transform → Load (transform before loading). ELT = Extract → Load → Transform (load raw, then transform inside the warehouse). Plus batch vs streaming pipelines, incremental loads, and CDC (change data capture).

Pro tip: the three databases worth installing for practice: PostgreSQL (covers 90% of relational SQL you'll see at work), SQLite (zero-setup local dev), and one NoSQL (MongoDB is the friendliest). Skip Redis until you genuinely need a cache; skip Cassandra until you genuinely have wide-column data.

Relational databases — tables, keys, normalisation, ACID

Relational databases store data in tables with primary keys (one column uniquely identifies each row) and foreign keys (a column in one table references the primary key of another). Normalisation splits data so each fact lives in exactly one place — no duplication, no inconsistency. ACID properties (Atomicity, Consistency, Isolation, Durability) guarantee that transactions either fully succeed or fully roll back.

Primary key — uniquely identifies a row (customer_id in customers).
Foreign key — points to another table's primary key (orders.customer_id → customers.customer_id).
Normalisation — 1NF / 2NF / 3NF — split tables until each fact lives once.
Indexing — speeds up lookups; trade-off is slower writes and extra storage.
ACID transactions — BEGIN; … COMMIT; (or ROLLBACK; on failure).

Example input. A two-table relational design — customers references orders via the foreign key customer_id.

customers:

customer_id	customer_name
C1	Alice
C2	Bob

orders:

order_id	customer_id	amount
101	C1	100
102	C2	200
103	C1	50

Question. Write the CREATE TABLE statements for customers and orders with proper primary keys and a foreign key from orders to customers. Then write a transactional INSERT that adds a new customer plus their first order atomically — both rows commit or neither does.

Code solution.

CREATE TABLE customers (
    customer_id   TEXT PRIMARY KEY,
    customer_name TEXT NOT NULL
);

CREATE TABLE orders (
    order_id    INT PRIMARY KEY,
    customer_id TEXT NOT NULL REFERENCES customers(customer_id),
    amount      NUMERIC(10,2) NOT NULL
);

BEGIN;
INSERT INTO customers (customer_id, customer_name) VALUES ('C3', 'Carol');
INSERT INTO orders (order_id, customer_id, amount) VALUES (104, 'C3', 75);
COMMIT;

Explanation of code. The customers table declares customer_id as PRIMARY KEY (uniqueness + index automatically created). The orders table's customer_id is REFERENCES customers(customer_id) — a foreign key that prevents you from inserting an order for a non-existent customer. The BEGIN; … COMMIT; block makes both inserts a single atomic transaction: if the second insert fails for any reason, the first is rolled back automatically — the database never ends up with a customer who has no order or an order pointing to a missing customer.

Step-by-step.

step	statement	result
1	`CREATE TABLE customers`	empty table, `customer_id` enforced unique
2	`CREATE TABLE orders`	empty table; FK rejects orphan `customer_id`
3	`BEGIN`	open a transaction — changes are invisible until commit
4	`INSERT INTO customers ('C3', 'Carol')`	row staged; FK in `orders` will accept C3 later
5	`INSERT INTO orders (104, 'C3', 75)`	row staged; FK satisfied because C3 exists in-tx
6	`COMMIT`	both rows persisted atomically; on error, both rolled back

Output.

After the transaction commits:

customer_id	customer_name
C1	Alice
C2	Bob
C3	Carol

order_id	customer_id	amount
101	C1	100
102	C2	200
103	C1	50
104	C3	75

Rule of thumb: every multi-row write that has to be "all or nothing" goes inside a BEGIN; … COMMIT; block — that's the entire point of a relational database.

Data warehousing — fact tables, dimension tables, star schema

A data warehouse stores analytics-ready data optimised for fast SELECT queries (not for high-volume INSERT / UPDATE). The canonical model is the star schema — one fact table in the middle that records events (sales, clicks, logins) surrounded by dimension tables that describe context (customers, products, dates). Heavily tested at interviews.

Fact table — measures events; mostly numeric columns + foreign keys to dimensions.
Dimension table — descriptive context; mostly text columns (customer name, product category).
Star schema — one fact in the centre, dimensions hanging off as star points.
Snowflake schema — dimensions further normalised into sub-dimensions.
Partitioning / clustering — physical layout choices that speed up filtered queries.

Example input. A star-schema design for an e-commerce sales fact with three dimensions.

fact_sales:

sale_id	date_id	customer_id	product_id	amount
S1	20260501	C1	P1	100
S2	20260501	C2	P2	200

dim_customer:

customer_id	customer_name
C1	Alice
C2	Bob

dim_product:

product_id	product_name
P1	Book
P2	Headphones

dim_date:

date_id	day	month	year
20260501	1	5	2026

Question. Write a query that joins the fact to all three dimensions and returns customer_name, product_name, month, and amount for every sale. This is the canonical "fact + dim rollup" report every BI dashboard runs.

Code solution.

SELECT
    c.customer_name,
    p.product_name,
    d.month,
    f.amount
FROM fact_sales f
JOIN dim_customer c ON c.customer_id = f.customer_id
JOIN dim_product  p ON p.product_id  = f.product_id
JOIN dim_date     d ON d.date_id     = f.date_id;

Explanation of code. The fact table sits in the middle and is joined once to each dimension on the matching surrogate key. Because each dimension has exactly one row per dimension key, the joins do not multiply rows — the output has the same number of rows as fact_sales. The SELECT then pulls the descriptive columns from the dimensions plus the amount from the fact.

Step-by-step.

step	action	result
1	scan `fact_sales`	2 rows (S1, S2)
2	join `dim_customer` on `customer_id`	S1 → Alice, S2 → Bob; row count unchanged (1:1)
3	join `dim_product` on `product_id`	S1 → Book, S2 → Headphones; row count unchanged
4	join `dim_date` on `date_id`	both rows pick up month=5; row count unchanged
5	`SELECT` 4 projected columns	final 2-row report

Output.

customer_name	product_name	month	amount
Alice	Book	5	100
Bob	Headphones	5	200

Rule of thumb: fact tables hold the measure; dimensions hold the context. If you can't tell whether a column belongs in the fact or the dim, ask "is this a number we'll aggregate, or text we'll group by?"

ETL vs ELT, batch vs streaming, and CDC in plain English

ETL = extract, transform, load — read source data, transform it in a separate engine (Spark, Python), then load the clean result into the warehouse. ELT = extract, load, transform — load the raw source straight into the warehouse, then transform with SQL. Modern cloud warehouses are powerful enough that ELT has become the default. Batch processes data on a schedule (every hour / day); streaming processes data as it arrives (sub-second). CDC (change data capture) tracks INSERT / UPDATE / DELETE events on a source so the warehouse stays in sync without re-loading the whole table.

ETL — transform outside the warehouse (older pattern; Spark, Python, custom).
ELT — transform inside the warehouse with SQL (newer; dbt, Snowflake, BigQuery).
Batch — scheduled jobs (hourly, daily); cheaper, simpler, slightly stale data.
Streaming — event-by-event processing (Kafka, Flink); fresher, more expensive.
CDC — incremental change tracking; loads only what changed since last run.

Example input. A daily-batch ETL skeleton in Python that loads yesterday's orders, transforms them, and writes a curated table.

source: raw orders dropped to S3 daily under s3://orders/2026-05-08/orders.csv
target: warehouse table fact_orders, partitioned by order_date

Question. Sketch the three-stage ETL pipeline shape — extract reads the CSV, transform cleans / dedupes / casts types, load writes to the warehouse. Use plain Python pseudocode; the goal is the shape not a runnable example.

Code solution.

import pandas as pd

def extract(date: str) -> pd.DataFrame:
    return pd.read_csv(f"s3://orders/{date}/orders.csv")

def transform(df: pd.DataFrame) -> pd.DataFrame:
    df = df.drop_duplicates(subset=["order_id"])
    df["order_date"] = pd.to_datetime(df["order_date"]).dt.date
    df["amount"]     = df["amount"].astype(float)
    return df

def load(df: pd.DataFrame, partition_date: str) -> int:
    # warehouse-specific COPY or INSERT INTO fact_orders WHERE order_date = partition_date
    return len(df)

if __name__ == "__main__":
    date = "2026-05-08"
    rows = load(transform(extract(date)), date)
    print(f"loaded {rows} rows for {date}")

Explanation of code. extract is the only function that knows where the source is; transform is pure (no I/O) and easy to unit-test; load is the only function that writes to the warehouse. Splitting the pipeline into three named functions makes the script readable, testable, and easy to swap (you can replace extract with a Postgres reader without touching transform). The dedupe + type-cast inside transform is the canonical "raw → curated" cleaning step.

Step-by-step.

step	function	what it does	result
1	`extract("2026-05-08")`	read S3 path for that day	raw DataFrame from CSV
2	`transform(df)` step a	`drop_duplicates(subset=["order_id"])`	duplicate orders removed
3	`transform(df)` step b	`pd.to_datetime(...).dt.date`	`order_date` cast to date type
4	`transform(df)` step c	`astype(float)`	`amount` cast to float
5	`load(df, date)`	warehouse `COPY` / `INSERT`	row count returned
6	`print(...)`	stdout summary	`loaded 5 rows for 2026-05-08`

Output.

loaded 5 rows for 2026-05-08

Rule of thumb: always separate extract, transform, load into three named functions — even when the pipeline is small. The shape is what reviewers look for.

Common beginner mistakes

Treating data warehouses like OLTP databases — running thousands of UPDATEs per minute (warehouses optimise for SELECT, not UPDATE).
Modelling everything in one wide table — kills performance and makes joins impossible later.
Confusing batch and streaming — batch is the default; pick streaming only when you genuinely need sub-second freshness.
Forgetting CDC — re-loading the whole customers table every night when only 100 rows changed wastes hours.
Skipping the staging step — going source → curated directly means you can't reproduce yesterday's run.

Worked Problem on Building an Idempotent Daily ETL with Quality Checks

Example input. A daily CSV orders_2026-05-08.csv that lands in S3. The warehouse has a fact_orders table partitioned by order_date. The pipeline must be idempotent — running it twice with the same input produces the same output.

order_id	order_date	amount
1	2026-05-08	100
2	2026-05-08	200
3	2026-05-08	50

Question. Write a Python script that loads the daily CSV, replaces today's partition (so a rerun does not double-count), and runs three data-quality checks (row count > 0, no NULL order_ids, no duplicate order_ids). Fail loudly with a non-zero exit code if any check fails.

Solution Using `TRUNCATE` of today's partition + three quality checks

Code solution.

import sys
import pandas as pd
import psycopg2

LOAD_DATE = "2026-05-08"

def run(conn, csv_path: str) -> int:
    df = pd.read_csv(csv_path)
    with conn.cursor() as cur:
        cur.execute("DELETE FROM fact_orders WHERE order_date = %s;", (LOAD_DATE,))
        for _, row in df.iterrows():
            cur.execute(
                "INSERT INTO fact_orders (order_id, order_date, amount) VALUES (%s, %s, %s);",
                (int(row["order_id"]), row["order_date"], float(row["amount"])),
            )
        cur.execute("SELECT COUNT(*) FROM fact_orders WHERE order_date = %s;", (LOAD_DATE,))
        if cur.fetchone()[0] == 0:
            return 1
        cur.execute("SELECT COUNT(*) FROM fact_orders WHERE order_id IS NULL;")
        if cur.fetchone()[0] > 0:
            return 1
        cur.execute("""
            SELECT COUNT(*) FROM (
              SELECT order_id, COUNT(*) c FROM fact_orders GROUP BY 1 HAVING COUNT(*) > 1
            ) d;
        """)
        if cur.fetchone()[0] > 0:
            return 1
    conn.commit()
    return 0

if __name__ == "__main__":
    conn = psycopg2.connect(dbname="warehouse")
    sys.exit(run(conn, f"orders_{LOAD_DATE}.csv"))

Explanation of code. DELETE FROM fact_orders WHERE order_date = LOAD_DATE wipes today's partition before re-inserting — that's what makes the pipeline idempotent (a rerun overwrites today's slice instead of appending). The INSERT loop loads every CSV row with explicit type casts so dates land as dates and amounts land as numbers. Three quality checks then verify the load worked — non-zero row count, no null primary keys, no duplicates. Any failure returns exit code 1 so the orchestrator (Airflow, cron) notices automatically and the developer is paged.

Output.

After a healthy run:

order_id	order_date	amount
1	2026-05-08	100
2	2026-05-08	200
3	2026-05-08	50

Exit code: 0. A second run of the same script produces an identical fact_orders (idempotent).

Step-by-step trace for a clean 3-row CSV:

step	action	result
1	`DELETE WHERE order_date = '2026-05-08'`	today's partition wiped
2	`INSERT` 3 CSV rows	3 rows in today's partition
3	row-count check	3 > 0 → pass
4	null-PK check	0 nulls → pass
5	duplicate-PK check	0 dupes → pass
6	`commit`	exit `0`

Why this works — concept by concept:

DELETE of today's partition before insert — makes the pipeline idempotent; rerun overwrites instead of appending.
explicit type casts in the INSERT — int(), float(), ISO date strings make the warehouse see clean types.
three quality checks inside the same job — checks live next to the load, not in a "we'll add monitoring later" backlog.
non-zero exit code on failure — Airflow / cron / GitHub Actions detect the failure automatically.
conn.commit() only on success — bad runs roll back; the warehouse is never left half-loaded.
Cost — O(rows in today's CSV); the historical fact_orders is only scanned for the duplicate check.

Inline CTA: for the structured ETL learning path see ETL System Design for Data Engineering Interviews and the Data Modeling course.

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — Data Modeling
Data Modeling for Data Engineering Interviews

View course →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

4. Steps 6-9 — Apache Spark, Airflow, Cloud, and Data Modeling

From single-machine SQL and Pandas to production-scale pipelines

After SQL, Python, databases, and ETL fundamentals are solid, four scaling skills turn you from a script-writer into a production data engineer:

Step 6 — Apache Spark. The industry standard for large-scale processing; PySpark is its Python API. Learn DataFrames, transformations, actions, Spark SQL.
Step 7 — Workflow orchestration. Apache Airflow runs your pipelines on a schedule. Learn DAGs (directed acyclic graphs), tasks, operators, dependencies.
Step 8 — Cloud platforms. Modern data engineering lives on AWS, Azure, or GCP. Pick AWS first — it's the most asked. Learn S3, EC2, Lambda, Glue, Redshift, IAM.
Step 9 — Data modeling. OLTP vs OLAP, normalisation vs denormalisation, slowly changing dimensions (SCDs), fact-vs-dim design. Read the Kimball Data Warehouse Toolkit.

Pro tip: these are scale-and-production skills. Don't open them until SQL and Python are second nature. The most common fresher failure mode is "I learned Spark but I can't write a LEFT JOIN correctly under pressure." Master the foundations first.

Apache Spark + PySpark — the big-data engine

Apache Spark processes data that doesn't fit on a single machine by splitting work across a cluster. PySpark is its Python API — almost everything you do in Pandas has a PySpark equivalent, just distributed. The simplest entry point is SparkSession.builder.getOrCreate() followed by spark.read.csv(...) to load data.

SparkSession — the entry point; creates the cluster connection.
DataFrame — the main abstraction; like a Pandas DataFrame but distributed.
Transformations — select, filter, groupBy — lazy, build a plan.
Actions — show, count, write — trigger actual execution.
Spark SQL — register a DataFrame as a table and run SQL against it.

Example input. A sales.csv file similar to the Pandas example, but big enough that we want Spark to process it on a cluster.

order_id	region	amount
1	North	100
2	South	200
3	North	150

Question. Write a minimal PySpark script that reads sales.csv and shows the first few rows. Show the canonical SparkSession setup that every PySpark script begins with.

Code solution.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("demo").getOrCreate()

df = spark.read.csv("sales.csv", header=True, inferSchema=True)
df.show()

Explanation of code. SparkSession.builder.appName("demo").getOrCreate() either creates a new Spark session or attaches to an existing one — either way, you end up with a spark object that knows how to talk to the cluster. spark.read.csv("sales.csv", header=True, inferSchema=True) loads the file as a DataFrame, treating the first row as headers and inferring column types. df.show() is an action that triggers execution and prints the first 20 rows to stdout.

Step-by-step.

step	call	kind	result
1	`SparkSession.builder.appName(...).getOrCreate()`	setup	`spark` session attached to a (local or cluster) executor
2	`spark.read.csv(..., header=True, inferSchema=True)`	transformation (lazy)	DataFrame plan registered — no rows scanned yet
3	`df.show()`	action	plan executes: read CSV → infer types → render first 20 rows
4	stdout	—	grid-formatted table printed

Output.

+--------+------+------+
|order_id|region|amount|
+--------+------+------+
|       1| North|   100|
|       2| South|   200|
|       3| North|   150|
+--------+------+------+

Rule of thumb: in PySpark, transformations (filter, select, groupBy) are lazy — nothing runs until you call an action like .show(), .count(), or .write(). That's why the same PySpark code can be reused for 1 GB and 1 TB datasets.

Apache Airflow — DAGs, tasks, scheduling

Airflow is the workflow orchestrator most data teams use. You write a DAG (directed acyclic graph) of tasks; Airflow runs them on a schedule, respects dependencies, retries failures, and surfaces alerts. The minimum viable DAG is two tasks chained with >>.

DAG — the workflow; a Python file in dags/ directory.
Task — a single unit of work (run a SQL query, call an API, run a PySpark job).
Operator — a reusable task type (BashOperator, PythonOperator, SQLExecuteQueryOperator).
Dependencies — task1 >> task2 means "run task1 then task2."
Schedule — schedule_interval='@daily', '0 3 * * *' (cron), or None for manual.

Example input. A simple two-stage daily ETL — extract data from an API, load it into a warehouse table.

task1: extract — call the API, write raw JSON to S3
task2: load — read S3 JSON, INSERT into fact_events
schedule: daily at 03:00 UTC

Question. Write a minimal Airflow DAG that defines two PythonOperator tasks extract_task and load_task, and chains them so load_task only runs after extract_task succeeds.

Code solution.

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

def extract():
    pass  # call the API, write raw JSON to S3

def load():
    pass  # read S3 JSON, INSERT into fact_events

with DAG(
    dag_id="etl_pipeline",
    start_date=datetime(2026, 5, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    extract_task = PythonOperator(task_id="extract", python_callable=extract)
    load_task    = PythonOperator(task_id="load",    python_callable=load)
    extract_task >> load_task

Explanation of code. The with DAG(...) as dag: block defines the DAG metadata — its name, when it starts, how often it runs (@daily is shorthand for "every day at midnight"), and whether to backfill missed runs (catchup=False means "no, just run from now on"). Two PythonOperator tasks wrap the actual Python functions. extract_task >> load_task declares the dependency — Airflow will only run load_task if extract_task succeeds.

Step-by-step.

step	when	action	result
1	parse time	`DAG(...)` instantiates	DAG `etl_pipeline` registered in Airflow metadata
2	parse time	`PythonOperator(...)` ×2	two tasks attached to the DAG
3	parse time	`extract_task >> load_task`	dependency edge added (extract → load)
4	every day at midnight	scheduler triggers a DAG run	`extract` task starts
5	extract succeeds	scheduler sees green upstream	`load` task starts
6	load succeeds	DAG run marked success	green tick in calendar grid
6′	extract fails	downstream skipped	red tick; alert fires

Output.

In the Airflow UI, this DAG appears as two boxes connected by an arrow:

[extract] → [load]

Each daily run produces a tick in the calendar grid; failures are red, successes are green.

Rule of thumb: one DAG = one logical workflow. If you find yourself writing 50 tasks in a single DAG, you probably want 5 DAGs of 10 tasks each — easier to debug, easier to retry.

Cloud platforms — AWS first, then expand

Modern data engineering is cloud-based. Pick one platform first and learn its data services before branching out — most teams use AWS, so it's the highest-leverage starting point. Azure and GCP are equally valid second choices once you have one cloud under your belt.

S3 — object storage; where raw data lands.
EC2 — virtual machines; rarely touched directly anymore.
Lambda — serverless functions; great for small ETL triggers.
Glue — managed ETL service; runs Spark jobs without you managing the cluster.
Redshift — AWS data warehouse; SQL-compatible.
IAM — identity and access; non-optional — every cloud bug eventually traces back to permissions.

Example input. You have a daily CSV that lands in S3 at s3://my-bucket/orders/{date}/orders.csv and a Redshift table fact_orders to load it into.

Question. Write the AWS CLI / SQL pseudocode that copies the CSV from S3 into Redshift on a schedule. (Don't worry about IAM details; the goal is the shape.)

Code solution.

-- Inside Redshift, run on a schedule from Airflow / cron
COPY fact_orders
FROM 's3://my-bucket/orders/2026-05-08/orders.csv'
IAM_ROLE 'arn:aws:iam::ACCOUNT:role/RedshiftS3ReadRole'
CSV
IGNOREHEADER 1;

Explanation of code. COPY ... FROM 's3://...' is the Redshift-specific bulk-load command — it pulls a file directly from S3 into a table without needing an intermediate machine. IAM_ROLE references an AWS IAM role that grants Redshift permission to read that S3 bucket — without this, the copy fails with a permission error. CSV tells Redshift the file format; IGNOREHEADER 1 skips the column-header row.

Step-by-step.

step	actor	action	result
1	Airflow / cron	submits `COPY` to Redshift	command queued
2	Redshift leader	assumes `RedshiftS3ReadRole`	temporary AWS credentials obtained
3	Redshift compute nodes	parallel-fetch the S3 object	bytes streamed direct to slices
4	parser	apply `CSV, IGNOREHEADER 1`	header row skipped; data rows parsed
5	loader	bulk-insert into `fact_orders`	rows committed
6	system catalogue	log to `STL_LOAD_COMMITS`	row count + reject count recorded

Output.

The CSV's rows land in fact_orders. A small status row is logged in STL_LOAD_COMMITS showing how many rows were copied and whether any were rejected.

Rule of thumb: for AWS, S3 + IAM are the two services you actually need to be fluent in. Everything else (Lambda, Glue, Redshift) layers on top.

Common beginner mistakes

Learning Spark before SQL is solid — Spark is just bigger SQL with more failure modes.
Writing 1,000-line Airflow DAGs — split into smaller DAGs that each do one thing.
Storing AWS credentials in code — always use IAM roles or environment variables, never hardcode.
Ignoring data modeling because it "feels theoretical" — interviewers test it heavily.
Trying to learn all three clouds at once — pick AWS first; the others are easier once you know one.

Worked Problem on Designing a Slowly Changing Dimension (Type 2) for Customer Addresses

Example input. Customer C1 lives at "12 Old St" until 2026-03-15, then moves to "88 New Ave". The fact tables need to know which address was current at the time of each historical sale.

customer_id	address	valid_from	valid_to	is_current
C1	12 Old St	2025-01-01	2026-03-14	FALSE
C1	88 New Ave	2026-03-15	(NULL)	TRUE

Question. Write the SQL to handle the address change as an SCD Type 2 update — close the old row by setting valid_to and is_current = FALSE, then insert a new row with the new address and is_current = TRUE. This pattern preserves historical correctness without losing the past.

Solution Using `UPDATE` to close the old row + `INSERT` for the new one

Code solution.

-- Step 1: close the existing current row
UPDATE dim_customer
SET valid_to = DATE '2026-03-14',
    is_current = FALSE
WHERE customer_id = 'C1' AND is_current = TRUE;

-- Step 2: insert the new current row
INSERT INTO dim_customer (customer_id, address, valid_from, valid_to, is_current)
VALUES ('C1', '88 New Ave', DATE '2026-03-15', NULL, TRUE);

Explanation of code. SCD Type 2 keeps full history by adding new rows rather than overwriting old ones. The UPDATE finds the row where is_current = TRUE for the customer and closes it — sets valid_to to the day before the change and is_current to FALSE. The INSERT then adds the new row with valid_from set to the change date, valid_to left NULL (still current), and is_current = TRUE. Historical fact tables can join to this dim with a date predicate to find the address that was current at the time of each sale.

Output.

After the two statements:

customer_id	address	valid_from	valid_to	is_current
C1	12 Old St	2025-01-01	2026-03-14	FALSE
C1	88 New Ave	2026-03-15	(NULL)	TRUE

A query like WHERE is_current = TRUE returns only the current address. A historical join uses WHERE sale_date BETWEEN valid_from AND COALESCE(valid_to, DATE '9999-12-31') to pick the right address per sale.

Step-by-step trace for the change on 2026-03-15:

step	action	result
1	`UPDATE` closes row 1	`valid_to = 2026-03-14`, `is_current = FALSE`
2	`INSERT` adds row 2	new row with `valid_from = 2026-03-15`, `is_current = TRUE`
3	dimension now has 2 rows for C1	one historical, one current

Why this works — concept by concept:

SCD Type 2 keeps history — old rows are not overwritten; both versions of the customer's address coexist with date ranges.
valid_from / valid_to define the row's lifetime — the date range during which this row was the truth.
is_current = TRUE flag — shortcut for dashboards that always want the latest; saves an ORDER BY ... LIMIT 1 lookup.
historical joins use BETWEEN — pick the dim row whose date range contains the fact row's date.
COALESCE(valid_to, '9999-12-31') — handles the open-ended current row whose valid_to is NULL.
Cost — two row-level operations; constant time per dimension change.

Inline CTA: for the deeper modeling syllabus see Data Modeling for Data Engineering Interviews; when you do start Spark, the gentle entry point is PySpark Fundamentals.

COURSE
Course — PySpark
PySpark Fundamentals

View course →

COURSE
Course — Spark internals
Apache Spark Internals

View course →

SQL
Topic — slowly changing data
SCD practice problems

Practice →

5. Steps 10-13 — Streaming, Portfolio Projects, Git, and Interview Prep

From skills to a job offer — proving the work and clearing the loop

The last four steps turn your skills into a job offer. Streaming systems handle real-time data, portfolio projects prove you can ship, Git makes your code visible, and interview prep closes the deal.

Step 10 — Streaming systems. Kafka, event-driven architectures, message queues, real-time processing. Required for advanced roles; optional for first jobs.
Step 11 — Build five portfolio projects. SQL analytics, Python ETL, Airflow pipeline, PySpark large-data, cloud deployment. Put all on GitHub.
Step 12 — Master Git. clone, add, commit, push, branch, merge — every company uses Git from day one.
Step 13 — Interview prep. SQL questions (joins, windows, aggregations, ranking), Python questions (dicts, strings, lists, hashmaps), system-design basics (ETL architecture, lake vs warehouse, batch vs streaming, scalability).

Pro tip: projects beat certificates. A GitHub repo with a clean README and a runnable pipeline outperforms a stack of certifications. Your top-of-funnel signal to recruiters is "here's the URL to my orders-batch-etl project" — not your transcript.

Streaming systems — Kafka in plain English

Kafka is a distributed message queue that lets producers publish events to a "topic" and consumers read them in order. Event-driven architectures use Kafka as the spine — payment events flow in, multiple downstream consumers (fraud detection, analytics, notifications) read the same stream independently. Required for advanced / senior DE roles; optional for fresher first jobs.

Producer — writes events to a Kafka topic.
Topic — a named append-only log; events stay in order.
Consumer — reads events from a topic; multiple consumers per topic are fine.
Partition — topics are split into partitions for parallelism.
Use case — live payment events flowing into a fraud-detection model.

Example input. A payment event payload that a producer wants to publish to the payments topic.

event = {
    "payment_id": "PAY-1001",
    "amount": 250.00,
    "currency": "USD",
    "user_id": "U42",
    "ts": "2026-05-08T10:15:00Z",
}

Question. Sketch the producer-side Python code that publishes this event to a Kafka topic called payments. Use kafka-python (the most popular client). Include just the producer setup + send call.

Code solution.

import json
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=["localhost:9092"],
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
)

event = {
    "payment_id": "PAY-1001",
    "amount": 250.00,
    "currency": "USD",
    "user_id": "U42",
    "ts": "2026-05-08T10:15:00Z",
}

producer.send("payments", event)
producer.flush()

Explanation of code. KafkaProducer(bootstrap_servers=[...]) connects to one or more Kafka brokers. The value_serializer lambda turns the Python dict into JSON bytes (Kafka stores raw bytes, not Python objects). producer.send("payments", event) queues the event for delivery to the payments topic; producer.flush() blocks until the queued messages are actually sent. Downstream consumers (fraud detection, analytics) can read this event independently and in order.

Step-by-step.

step	action	result
1	`KafkaProducer(bootstrap_servers=[...])`	TCP connection to broker established; metadata fetched
2	`value_serializer = json.dumps(...).encode("utf-8")`	every send will convert dict → JSON bytes
3	`producer.send("payments", event)`	record buffered in the producer's in-memory queue
4	broker picks a partition (hashed key or round-robin)	record assigned to a partition log
5	`producer.flush()`	blocks until all buffered records are acknowledged
6	subscribed consumers	next `poll()` returns the new event

Output.

The event is appended to the payments topic. Any consumer subscribed to payments will receive it on its next poll() call:

{"payment_id": "PAY-1001", "amount": 250.0, "currency": "USD", "user_id": "U42", "ts": "2026-05-08T10:15:00Z"}

Rule of thumb: fresher first jobs rarely need Kafka. Master batch (Step 5) before opening Step 10. Mention Kafka in interviews only if you've actually shipped a project that uses it.

Five portfolio projects — what to build, in order

Projects matter more than certificates. Build all five and put them on GitHub with a clean README.md for each. The five build on each other — by the end you have a production-grade portfolio.

Project 1 — SQL analytics. E-commerce sales dashboard built entirely in SQL.
Project 2 — Python ETL. Extract API data → clean → store in PostgreSQL.
Project 3 — Airflow pipeline. Schedule the Python ETL as a daily DAG.
Project 4 — PySpark large-data pipeline. Process millions of rows.
Project 5 — Cloud project. Deploy the ETL pipeline on AWS.

Example input. Project 1 — an e-commerce dataset (orders, customers, products) for which you'll write the SQL behind a sales dashboard.

orders (sample):

order_id	customer_id	product_id	order_date	amount
1	C1	P1	2026-04-01	100
2	C2	P2	2026-04-15	200
3	C1	P1	2026-05-01	150

Question. For Project 1, write the canonical "monthly revenue per product" SQL. This is the single query the entire dashboard hangs off — get this right and the rest of the dashboard is just WHERE and ORDER BY variations.

Code solution.

SELECT
    p.product_name,
    DATE_TRUNC('month', o.order_date) AS month,
    SUM(o.amount) AS revenue
FROM orders o
JOIN products p ON p.product_id = o.product_id
WHERE o.order_date >= DATE '2026-01-01'
GROUP BY p.product_name, DATE_TRUNC('month', o.order_date)
ORDER BY month, p.product_name;

Explanation of code. DATE_TRUNC('month', o.order_date) collapses every order date to the first day of its month, so all April orders aggregate together. The JOIN brings in product_name from products so the dashboard can label rows. GROUP BY collapses to one row per (product, month). ORDER BY produces a chronologically-readable result. Wrap this in a saved view or a dbt model and the dashboard renders automatically.

Step-by-step.

step	clause	result
1	`FROM orders o`	scan all order rows
2	`JOIN products p ON p.product_id = o.product_id`	each order picks up its `product_name`
3	`WHERE o.order_date >= DATE '2026-01-01'`	drop pre-2026 rows
4	`DATE_TRUNC('month', o.order_date)`	every date snapped to month-start (e.g. 2026-04-15 → 2026-04-01)
5	`GROUP BY product_name, month`	bucket by (product, month)
6	`SUM(o.amount)` per bucket	revenue total per group
7	`ORDER BY month, product_name`	chronological, then alphabetical

Output.

product_name	month	revenue
Book	2026-04-01	100
Headphones	2026-04-15 (truncated to 2026-04-01)	200
Book	2026-05-01	150

(Real output would have one row per (product, month) combination; the sample is too small for a strong rollup.)

Rule of thumb: every Project 1 SQL should be runnable on a free PostgreSQL sandbox with a 100-row sample dataset. Put both the SQL and the sample data in your GitHub repo so a recruiter can clone and run it in 60 seconds.

Git, GitHub, and the resume bullet

Git is non-optional infrastructure. Every team's workflow assumes you can clone a repo, branch off, commit, and push. The bare minimum command set fits on a single screen.

git clone <url> — copy a remote repo locally.
git checkout -b feature/x — create + switch to a new branch.
git add file — stage a change.
git commit -m "..." — record the staged changes.
git push origin <branch> — push the branch to GitHub; open a pull request.
git merge / git rebase — combine branches.

Example input. You've finished Project 1 (the SQL analytics dashboard) on your laptop and want to push it to GitHub under your account.

Question. Show the canonical six-command workflow: clone an empty template repo, branch off, add the files you've written, commit with a descriptive message, push the branch, and open a pull request.

Code solution.

git clone https://github.com/<you>/sql-sales-dashboard.git
cd sql-sales-dashboard
git checkout -b feature/initial-dashboard
# (write README.md, schema.sql, queries.sql, sample-data/)
git add README.md schema.sql queries.sql sample-data/
git commit -m "Add Project 1: SQL sales dashboard with sample data"
git push origin feature/initial-dashboard
# open a pull request on github.com

Explanation of code. clone brings the empty repo to your laptop. checkout -b creates a feature branch (never push to main directly on a team repo, even your own — make the habit). After writing the project files, add stages them, commit records the change with a one-line message that future-you can scan, and push sends the branch to GitHub. The pull request is the artifact a recruiter or interviewer will actually look at.

Step-by-step.

step	command	result
1	`git clone <url>`	empty repo copied to laptop
2	`cd sql-sales-dashboard`	move into the working tree
3	`git checkout -b feature/initial-dashboard`	new branch created and checked out
4	write `README.md`, `schema.sql`, `queries.sql`, `sample-data/`	working tree now has 4 untracked items
5	`git add ...`	files staged for commit
6	`git commit -m "..."`	snapshot recorded with descriptive message
7	`git push origin feature/initial-dashboard`	branch published to GitHub
8	open PR on github.com	reviewable artifact link a recruiter can click

Output.

A GitHub repo URL with a feature branch and a pull request — both visible to anyone you share the link with. The README renders directly on the repo home page, becoming your portfolio artifact.

Rule of thumb: if you can't clone, branch, commit, and push within 60 seconds without looking commands up, Git is still on your to-do list. Practice it daily until it's muscle memory.

Common beginner mistakes

Trying to learn Kafka before mastering batch ETL — Kafka adds complexity without removing any.
Building one giant project instead of five small ones — recruiters skim; five clear repos beat one tangled one.
Pushing to main directly — every commit becomes part of history with no review trail.
No README.md per project — repos without READMEs are invisible.
Skipping interview prep — solid skills + zero practice = solid skills wasted at the screen.

Worked Problem on Picking Project 2 and Writing the Resume Bullet

Example input. You've shipped Project 1 (SQL dashboard). Project 2 is the Python ETL — extract from an API, clean, store in PostgreSQL. The repo will be python-api-etl. The recruiter call is in two weeks.

Question. Sketch the four-file layout for the Project 2 repo plus the one-line resume bullet you'll lead with on the recruiter call. The goal: a stranger should be able to read the repo, run it locally, and understand the work in 5 minutes.

Solution Using a four-file repo layout + a metric-led resume bullet

Code solution.

python-api-etl/
├── README.md           # 60-second pitch + how to run
├── etl.py              # extract / transform / load functions
├── tests/
│   └── test_etl.py     # one test per function
└── requirements.txt    # pinned dependencies

Resume bullet (lead with the metric):

Built a Python ETL pipeline that ingests 10K daily API records into a PostgreSQL warehouse with row-level validation and CI-friendly exit codes. (github.com/<you>/python-api-etl)

Explanation of code. Four files is the floor — one for documentation, one for code, one for tests, one for dependencies. The README is what a recruiter sees first; lead with the what and how to run, then explain the why. The resume bullet leads with a quantitative metric (10K daily records) and ends with the GitHub URL — recruiters scan for both, in that order.

Output.

Your GitHub now has a runnable, documented ETL repo. The recruiter receives the link, clicks through, sees the README, and forwards your resume to the hiring manager. The bullet on the resume becomes the first sentence of the recruiter's pitch to the hiring manager.

Step-by-step trace of how a recruiter reads it:

step	recruiter action	what they see
1	clicks the GitHub link in the resume	repo home page with the README rendered
2	scans the first paragraph of README	"Daily API → PostgreSQL with validation"
3	scrolls to "How to run"	three commands (`git clone`, `pip install`, `python etl.py`)
4	clicks `etl.py`	sees three named functions; reads in 30 seconds
5	clicks `tests/`	tests exist; quality signal confirmed

Why this works — concept by concept:

one repo per project — recruiters skim; five clean repos beat one tangled monorepo.
README-first design — the home page is the pitch; lead with what + how to run.
tests in tests/ — even one test per function is a quality signal.
pinned requirements.txt — anyone can clone and run; no "works on my machine" surprises.
metric-led resume bullet — 10K daily records is concrete; "ETL pipeline" alone is generic.
Cost — about a weekend of focused work for the project; 30 minutes for the bullet.

Inline CTA: for fresher interview reps see SQL practice page, Python practice page, and the canonical course path SQL for Data Engineering Interviews — From Zero to FAANG.

SQL
Hub — all practice
Browse all data-engineering practice

Practice →

COURSE
Course — SQL for DE
SQL for Data Engineering Interviews

View course →

COURSE
Hub — all courses
Browse all DE courses

View courses →

Tips to master the data engineering roadmap (best learning order + timeline)

Follow the order — and the calendar

The 13 steps above have a best learning order that works for most freshers — skip ahead at your own risk. The order plus a realistic timeline:

Order: SQL → Python → Databases → Pandas → ETL concepts → Data Warehousing → PySpark → Airflow → Cloud (AWS) → Kafka → Projects → Interview prep.
2-3 months — SQL + Python basics solid.
4-6 months — intermediate DE (warehousing, ETL, modeling).
6-9 months — job-ready (Airflow, cloud, projects shipped).
9-12 months — strong fresher profile (Spark, streaming basics, polished portfolio).

Most freshers fail for the same four reasons — avoid them

The failure modes are predictable. Watch for these in your own routine:

Jumping to Spark too early. Spark is just bigger SQL with more failure modes; without solid SQL it's noise.
Ignoring SQL depth. Beyond SELECT and JOIN, the bar at the screen is window functions + grain reasoning. Drill them.
Avoiding projects. Tutorials and certifications are signals; shipped code on GitHub is proof.
Watching tutorials without practice. Watch the video → close it → rebuild the example without it. If you can't, you didn't learn it.

The winning formula

Every successful fresher career follows the same five-step loop: learn → practice → build → publish → interview. Pick a topic, drill it on a coding-environment, build a small artifact, push to GitHub, then interview for jobs that touch that topic. Repeat for each step in the roadmap.

Books worth buying

Designing Data-Intensive Applications (Martin Kleppmann) — the modern systems book; read once a quarter.
The Kimball Data Warehouse Toolkit (Ralph Kimball) — the canonical modeling reference.

Where to practice on PipeCode

Start with the SQL practice page and the Python practice page; the structured paths are SQL for Data Engineering Interviews — From Zero to FAANG and Python for Data Engineering Interviews — Complete Fundamentals. After SQL and Python land, drill ETL practice, window functions, joins, and the deeper Data Modeling course and ETL System Design course. Pivot to peer guides — the Airbnb DE interview guide, the top DE interview questions 2026, and the SQL data types Postgres guide.

Frequently Asked Questions

How long does it really take to become a data engineer in 2026?

If consistent, 6-9 months at 10-15 hours per week is enough to be job-ready for junior / fresher data-engineering roles; 9-12 months produces a strong fresher profile with Spark, streaming basics, and a polished portfolio. The 2-3 months mark is where SQL and Python basics click; 4-6 months gets you through warehousing, ETL, and modeling. The single biggest predictor of speed is consistency — 10 hours a week for 6 months beats 40 hours a week for 6 weeks.

Do I need to learn all 13 steps before applying for jobs?

No — start applying as soon as Steps 1-5 are solid (SQL, Python, databases, warehousing, ETL/ELT). Roles you can target with the first five steps done: junior data engineer, junior analytics engineer, data engineer intern, ETL developer trainee. Steps 6-9 (Spark, Airflow, Cloud, Modeling) turn "hireable" into "competitive." Steps 10-13 (Streaming, Projects, Git, Interview prep) close the deal. Apply earlier than you think you should — interviewing is itself a skill that needs reps.

Should I master one cloud or learn all three (AWS, Azure, GCP)?

Pick one first and master its core data services before touching the others. AWS is the most asked at fresher interviews and the most widely deployed in industry — start there. The core AWS services for fresher DE work: S3 (object storage), IAM (access control), Lambda (serverless functions), Glue (managed ETL), Redshift (warehouse). Once you have one cloud under your belt, the other two are easy because the concepts (object storage, IAM, serverless, managed ETL, warehouse) are the same — only the names change.

Is Apache Spark required for fresher data-engineering jobs?

For most fresher first jobs, no — but knowing what Spark is and when it appears is required. The honest fresher posture: "I've shipped batch ETL with Python and SQL; I know Spark is the next step when data outgrows a single machine; I've done the PySpark Fundamentals tutorial and would learn the rest on the job." That's enough for 80% of fresher screens. Roles at Spark-heavy shops (Databricks customers, ad-tech, large e-commerce) will test deeper — for those, ship a PySpark project as part of your Step 11 portfolio.

What does a data engineer actually do day-to-day?

Day-to-day, a data engineer writes SQL queries, builds and maintains batch pipelines, models new tables, fixes data quality issues, and reviews other engineers' pipelines. A typical week: Monday — investigate a Slack message about a wrong dashboard number (usually a grain or null-handling bug); Tuesday-Wednesday — model a new dimension table for a product launch; Thursday — code review on a teammate's Airflow DAG; Friday — add a quality check that would have caught Monday's bug. Spark, Kafka, and lakehouse architecture appear at scale-heavy companies; the day-to-day at most companies is SQL + modeling + pipelines.

What's the difference between a data engineer, data analyst, and data scientist?

Data engineers build the pipelines and tables; analysts query them for business questions; scientists run experiments and ML models on top. In a typical e-commerce team: a DE owns the daily ETL that loads cur_orders; an analyst writes the SQL behind the daily revenue dashboard; a scientist runs the A/B test that decides whether the new checkout flow ships. The roles overlap on SQL — every analytics person writes it — but only DEs own the infrastructure that produces the tables everyone else queries. Salaries also follow this stack — DEs are typically paid more than analysts and on par with scientists at most companies.

1. Step 1 — Master SQL: The Most Important Skill for a Data Engineer

SQL fundamentals, joins, aggregations, window functions, and the queries you'll write every day

SQL basics — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT

Aggregations — GROUP BY + HAVING

Joins — connecting tables on a common key

Common beginner mistakes

Worked Problem on Ranking Top Earners per Department with Window Functions

Solution Using ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)

2. Step 2 — Learn Python for Data Engineering

Core Python, file handling, Pandas, and the API requests every DE writes

Core Python — loops, lists, and small functions

File handling — reading CSV and JSON

Pandas for tabular data — read_csv, groupby, sum

Common beginner mistakes

Worked Problem on Building a CSV-to-Summary Python ETL Script

Solution Using Pandas + a writeable summary path

3. Steps 3-5 — Databases, Data Warehousing, and ETL/ELT

How data is stored, modeled, and moved through pipelines

Relational databases — tables, keys, normalisation, ACID

Data warehousing — fact tables, dimension tables, star schema

ETL vs ELT, batch vs streaming, and CDC in plain English

Common beginner mistakes

Worked Problem on Building an Idempotent Daily ETL with Quality Checks

Solution Using TRUNCATE of today's partition + three quality checks

4. Steps 6-9 — Apache Spark, Airflow, Cloud, and Data Modeling

From single-machine SQL and Pandas to production-scale pipelines

Apache Spark + PySpark — the big-data engine

Apache Airflow — DAGs, tasks, scheduling

Cloud platforms — AWS first, then expand

Common beginner mistakes

Worked Problem on Designing a Slowly Changing Dimension (Type 2) for Customer Addresses

Solution Using UPDATE to close the old row + INSERT for the new one

5. Steps 10-13 — Streaming, Portfolio Projects, Git, and Interview Prep

From skills to a job offer — proving the work and clearing the loop

Streaming systems — Kafka in plain English

Five portfolio projects — what to build, in order

Git, GitHub, and the resume bullet

Common beginner mistakes

Worked Problem on Picking Project 2 and Writing the Resume Bullet

Solution Using a four-file repo layout + a metric-led resume bullet

Tips to master the data engineering roadmap (best learning order + timeline)

Follow the order — and the calendar

Most freshers fail for the same four reasons — avoid them

The winning formula

Books worth buying

Where to practice on PipeCode

Frequently Asked Questions

How long does it really take to become a data engineer in 2026?

Do I need to learn all 13 steps before applying for jobs?

Should I master one cloud or learn all three (AWS, Azure, GCP)?

Is Apache Spark required for fresher data-engineering jobs?

What does a data engineer actually do day-to-day?

What's the difference between a data engineer, data analyst, and data scientist?

Start practicing data engineering interview problems

SQL basics — `SELECT`, `WHERE`, `ORDER BY`, `LIMIT`, `DISTINCT`

Aggregations — `GROUP BY` + `HAVING`

Solution Using `ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)`

Pandas for tabular data — `read_csv`, `groupby`, `sum`

Solution Using `TRUNCATE` of today's partition + three quality checks

Solution Using `UPDATE` to close the old row + `INSERT` for the new one