Gowtham Potureddi

Posted on May 12

Snowflake for Data Engineering Interviews: A Beginner's Guide to the Cloud Data Warehouse

#python #sql #interview #dataengineering

Snowflake is the cloud-native data warehouse most modern data teams stand on — it stores petabytes, runs analytics in seconds, scales compute and storage independently, and ships features (Time Travel, zero-copy cloning, secure data sharing) that legacy MPP warehouses cannot match. For freshers preparing for data-engineering interviews, Snowflake is a high-leverage skill: the architecture is fundamentally different from Postgres or MySQL, and the same two or three concepts show up in every interview loop.

Think of this as a beginner-friendly Snowflake tutorial for data engineers — a first-principles walk through the Snowflake data warehouse from three-layer architecture to performance tuning. We start with "what is Snowflake database" in plain English, cover the killer "separation of compute and storage" idea, virtual warehouses, COPY INTO for loading, Time Travel and cloning for recovery / dev, micro-partitions and query pruning for performance, and how Snowflake compares to Redshift, BigQuery, Databricks, and Azure Synapse. Every section ships worked examples and a Snowflake interview questions-style problem with a full solution tail, in the same shape PipeCode practice problems use.

If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse ETL practice →, or open ETL System Design for Data Engineering Interviews → for a structured path.

On this page

Why Snowflake matters
The three-layer architecture
Separation of compute and storage
Loading and querying data
Time Travel and zero-copy cloning
Performance optimization
Snowflake vs Redshift vs BigQuery
Choosing Snowflake (checklist)
Frequently asked questions
Practice on PipeCode

1. Why Snowflake matters

What is Snowflake database — cloud-native data warehousing for analytics at scale

So, what is Snowflake database in one sentence? Snowflake is a cloud-based Snowflake data warehouse platform used to store, process, and analyze enormous amounts of data — orders of magnitude beyond what a single Postgres or MySQL instance can handle. Companies use it for data warehousing, analytics, BI dashboards, Snowflake data sharing, ETL/ELT pipelines, and ML feature storage, and it runs as a managed service on AWS, GCP, and Azure — the same Snowflake SQL surface, same UI, same features regardless of which cloud you pick.

Pro tip: When an interviewer asks "why Snowflake?", lead with the workload, not the brand. Snowflake exists because OLTP databases (Postgres, MySQL) become slow once a single table crosses a few hundred million rows under heavy analytical reads. Snowflake separates the analytical workload from the transactional one and scales each part independently.

Data warehouse vs OLTP database — different shapes for different jobs

The warehouse invariant: OLTP databases (Postgres, MySQL) are optimised for high-frequency single-row reads and writes; data warehouses (Snowflake, Redshift, BigQuery) are optimised for low-frequency, very wide scans over billions of rows; using one for the other workload produces a system that is slow on both axes. The line between them is mostly about row-store vs columnar storage and transactional vs analytical query patterns.

OLTP — row-store: Postgres / MySQL store rows contiguously; reading one row of 30 columns is one disk seek.
OLAP — columnar: Snowflake / Redshift / BigQuery store columns contiguously; reading one column of 100 M rows is one sequential scan.
Transactions: OLTP holds row-level locks for ACID writes; warehouses commit in batches.
Concurrency model: warehouses scale by spinning up parallel compute clusters; OLTP scales vertically.

Worked example. Same 100 M-row orders table, two workloads:

query	Postgres (OLTP)	Snowflake (warehouse)
`INSERT INTO orders … VALUES (…)`	~1 ms	~500 ms (batched)
`SELECT category, SUM(amount) FROM orders GROUP BY 1`	60 s scan	2 s columnar scan
`UPDATE orders SET status='shipped' WHERE order_id = 42`	~1 ms	~500 ms
50 concurrent analyst dashboards	dies under load	each on its own warehouse

Step-by-step.

Postgres is great for the transactional inserts — single-row writes complete in milliseconds.
Once an analyst runs GROUP BY category over 100 M rows, Postgres scans every row and blocks the OLTP workload.
Snowflake stores amount and category as separate compressed column files; the same GROUP BY reads ~5% of the bytes Postgres reads.
Snowflake also supports many parallel "virtual warehouses" so the 50 dashboards do not contend with the daily ETL.
The right move is to keep transactional work in Postgres and ELT the data into Snowflake for analytics.

Worked-example solution. A typical split:

Application writes        Daily ELT                Analytics
─────────────────────     ────────────────────     ─────────────────
Postgres                  source → Snowflake       Snowflake
(orders, users,             every hour or            BI dashboards,
 payments)                  every minute             ML features,
                                                     ad-hoc SQL

Rule of thumb: if a query joins many tables, scans many rows, and runs on a schedule for humans to read, it belongs in a warehouse — not in your OLTP database.

Multi-cloud as a feature, not a buzzword

The multi-cloud invariant: Snowflake runs the same control plane and SQL surface on AWS, GCP, and Azure; an account is bound to one cloud and one region, but secure data sharing crosses cloud boundaries and replication is built in. You pick the cloud that matches the rest of your stack; you do not get locked into the warehouse vendor's preferred cloud.

Account region — one cloud (AWS / GCP / Azure) and one region (e.g. us-east-1).
Cross-region replication — built-in; for HA and analytics close to consumers.
Cross-cloud data sharing — shared databases work even when provider and consumer live on different clouds.
Same SQL surface — CREATE WAREHOUSE, COPY INTO, Time Travel work identically across clouds.

Worked example. A SaaS company runs ingestion on GCP and BI on AWS:

component	cloud	reason
product backend	GCP	existing team
ingestion → Snowflake	GCP-region Snowflake	same-cloud latency
BI dashboards	AWS-region Snowflake (read replica)	analyst tools on AWS

Step-by-step.

Ingestion writes raw events to a GCP Snowflake account; same-cloud egress is free / minimal.
A Snowflake replication policy mirrors the curated schema to an AWS Snowflake account every 15 minutes.
Analyst tools (Tableau, Looker, Mode) all live on AWS; queries hit the AWS account with low latency.
Disaster recovery comes for free — either account can survive a single-cloud incident.
The application teams pick whichever cloud suits them; the warehouse is a no-fight.

Worked-example solution. Cross-cloud replication setup (concept):

-- on the primary (GCP) account
CREATE REPLICATION GROUP analytics_repl
    OBJECT_TYPES = (DATABASES)
    ALLOWED_DATABASES = ('PROD_DW')
    ALLOWED_ACCOUNTS = ('aws_account_locator');

-- on the secondary (AWS) account
CREATE DATABASE PROD_DW
    AS REPLICA OF gcp_account_locator.PROD_DW;
ALTER DATABASE PROD_DW REFRESH;

Rule of thumb: let the team that writes the data pick the cloud and let everyone else attach via shared databases or replicated copies.

Real-world use cases — where Snowflake earns its keep

The use-case invariant: Snowflake is the right tool when the workload is analytical, the data volume is large, and the user count is concurrent; it is the wrong tool for low-latency single-row reads, sub-second OLTP transactions, or kilobyte-scale lookup tables. Recognising the workload is half the interview answer.

BI dashboards — Looker, Tableau, Mode, Power BI all read from Snowflake natively.
Customer analytics — clickstream, retention cohorts, funnel analysis.
ML feature stores — typed, time-partitioned features served to training and online inference.
Financial reporting — NUMERIC(38,6) precision, ACID transactions, audit history via Time Travel.
Secure data sharing — sell anonymised datasets to partners without ETL or file transfer.

Worked example. An e-commerce company's Snowflake schema:

table	grain	source	consumer
`fact_orders`	one row per order line	Postgres CDC	BI, finance, ML
`fact_clicks`	one row per page view	Kafka → Kinesis	marketing
`dim_customer`	one row per customer (SCD2)	Postgres CDC	every fact
`dim_product`	one row per product	Postgres CDC	every fact

Step-by-step.

Orders, clicks, payments live transactionally in Postgres; events stream through Kafka.
A CDC pipeline (Fivetran, Airbyte, custom Debezium) lands raw rows into Snowflake every few minutes.
dbt models build star-schema fact / dimension tables from the raw layer.
BI tools query the gold layer via SELECT … FROM dim_customer JOIN fact_orders ….
Same fact_orders table feeds the daily revenue dashboard, the monthly investor report, and the ML feature pipeline — no copies, no drift.

Worked-example solution. A minimal fact_orders schema:

CREATE TABLE fact_orders (
    order_id    NUMBER(38,0) PRIMARY KEY,
    customer_id NUMBER(38,0) NOT NULL,
    product_id  NUMBER(38,0) NOT NULL,
    order_date  DATE         NOT NULL,
    amount      NUMBER(14,2) NOT NULL
)
CLUSTER BY (order_date);

Rule of thumb: "is this a dashboard, an ML feature, or a recurring report?" → Snowflake. "Is this a real-time write?" → Postgres / DynamoDB / Cassandra.

Common beginner mistakes

Treating Snowflake as a faster Postgres — running single-row INSERTs in a loop is slow because every commit is batched on object storage.
Picking Snowflake when the dataset fits on one machine — a daily 1 GB CSV does not need a cloud warehouse; SQLite or DuckDB are cheaper and faster.
Forgetting to suspend warehouses — every minute a warehouse runs is billed; idle warehouses are real money.
Storing OLTP-shaped row-by-row data — Snowflake compresses columns; wide schemas with few rows are an anti-pattern.
Skipping the architecture layer in interviews — "Snowflake is fast" is not an answer; "it separates compute and storage" is.

Snowflake Interview Question on Picking a Warehouse vs Database

A team is debating whether to put a 100 M-row monthly aggregate report on top of their OLTP Postgres database or load it into Snowflake first. The Postgres database also serves the live shopping cart. Lay out the decision criteria and propose an architecture that keeps both the cart and the report performant.

Solution Using Postgres for OLTP + Snowflake for OLAP via Daily ELT

Code solution.

-- Postgres holds the transactional truth
CREATE TABLE postgres.public.orders (
    order_id    BIGSERIAL    PRIMARY KEY,
    customer_id BIGINT       NOT NULL,
    placed_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    amount      NUMERIC(14,2)
);

-- A daily ELT job lands the same rows in Snowflake
COPY INTO snowflake.dw.fact_orders
FROM @postgres_stage/orders/dt=2026-05-11/
FILE_FORMAT = (TYPE = PARQUET);

-- The monthly report runs entirely in Snowflake
SELECT DATE_TRUNC('month', placed_at) AS month,
       SUM(amount)                    AS revenue
FROM snowflake.dw.fact_orders
GROUP BY 1
ORDER BY 1;

Step-by-step trace.

step	actor	action	outcome
1	shopping cart	inserts new order into Postgres	row committed in ms
2	nightly ELT	exports yesterday's orders to Parquet on S3	one staged file per day
3	nightly ELT	runs `COPY INTO` Snowflake	rows added to `fact_orders`
4	analyst	runs monthly report on Snowflake	2 s columnar scan
5	Postgres	only sees OLTP load	cart stays fast

Output: the cart latency stays under 100 ms because Postgres only runs OLTP work; the monthly report returns in seconds because Snowflake scans the partitioned, compressed columnar copy; the two systems never block each other.

Why this works — concept by concept:

Postgres for OLTP — row-store + indexes + ACID transactions; perfect for single-order writes.
Snowflake for OLAP — columnar + massively parallel; perfect for full-table aggregations.
Daily ELT — moves the analytical workload to the analytical engine; freshness is "yesterday's data" which is fine for monthly reports.
COPY INTO — Snowflake's bulk loader; parallelises file ingestion across compute nodes.
One source of truth — Postgres remains the system of record; Snowflake is a derived copy that can be rebuilt at any time.
Cost — Postgres reads stay O(1) per cart op; Snowflake aggregation is O(N) on the columnar copy but runs in parallel and never touches Postgres.

Inline CTA: drill the ETL practice page for ingestion patterns and the SQL practice page for analytical SQL fluency.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

2. The three-layer architecture

Storage, compute (virtual warehouses), and cloud services — decoupled by design

Snowflake's killer architectural decision is three independent layers that scale and pay for themselves separately: a storage layer that holds your data on cloud object storage forever, a compute layer of "virtual warehouses" that run queries in isolated clusters, and a cloud services layer that handles authentication, optimisation, metadata, and security. Legacy MPP warehouses (and older Redshift) couple compute and storage into a single cluster; Snowflake's split is what makes everything else (Time Travel, cloning, multi-tenant compute) possible.

Pro tip: When an interviewer asks "explain Snowflake's architecture," name the three layers in the order storage → compute → cloud services and immediately add "and the key idea is that compute and storage scale independently." That single sentence covers 70% of the architecture answer; the rest is detail.

Database storage layer — compressed columnar files on object storage

The storage invariant: Snowflake stores every table as a set of compressed, columnar **micro-partitions (50–500 MB each) on cloud object storage (S3 / GCS / ADLS); the database engine manages compression, encryption, metadata, and file organisation automatically — you write SQL, Snowflake handles the rest**. There is no VACUUM, no manual partitioning, no index maintenance.

Micro-partitions — 50–500 MB compressed columnar files; automatically sized.
Columnar format — every column stored separately; analytical scans read only needed columns.
Automatic compression — Snowflake picks the codec per column based on data distribution.
Immutable files — updates write new files; old files retained for Time Travel.
Per-column statistics — min/max/distinct count per micro-partition; powers query pruning.

Worked example. A 100 M-row orders table laid out internally:

component	what Snowflake stores
`orders.order_id`	~5,000 micro-partitions, sorted by order_id, RLE-compressed
`orders.amount`	same partitions, ZSTD-compressed
`orders.placed_at`	same partitions, dictionary-encoded dates
metadata	per-partition min/max/distinct for every column

Step-by-step.

You write INSERT INTO orders SELECT … FROM staging; Snowflake doesn't write rows — it writes columnar files.
Each batch produces a handful of new micro-partitions (typical batch ≈ 16 MB compressed).
The cloud-services layer records per-column min/max in metadata for every new partition.
A later WHERE placed_at = '2026-05-10' can skip ~99% of partitions using the date min/max — that's query pruning.
The original files are never modified; an UPDATE writes new partitions and marks the old ones as expired (visible via Time Travel for the retention window).

Worked-example solution. A typical Snowflake CREATE TABLE with Snowflake data types (NUMBER(p,s), TIMESTAMP_TZ) and a CLUSTER BY for predictable partition layout:

CREATE TABLE fact_orders (
    order_id    NUMBER(38,0),
    customer_id NUMBER(38,0),
    placed_at   TIMESTAMP_TZ,
    amount      NUMBER(14,2)
)
CLUSTER BY (placed_at);
-- micro-partitions now naturally co-locate by date,
-- making date-range queries skip more partitions

Rule of thumb: you do not manage storage; you do not run VACUUM. If a query is slow on a clustered table, the answer is usually change the cluster key, not "rewrite the table."

Compute layer — virtual warehouses run the queries

The compute invariant: a virtual warehouse is a named, sized, isolated MPP compute cluster that runs your SQL; warehouses can be created, resumed, suspended, and resized independently; many warehouses can read the same storage simultaneously without contention. The pricing model is straightforward — you pay per credit per second of warehouse uptime; suspending a warehouse stops the meter.

Warehouse sizes — X-SMALL (1 node), SMALL (2), MEDIUM (4), … up to 6X-LARGE (512).
Multi-cluster warehouses — auto-scale parallel clusters when concurrency grows.
Auto-suspend / auto-resume — pause after N minutes idle; wake on demand.
Per-team isolation — ETL on warehouse A, analysts on warehouse B; one cannot slow the other.
Billing — per-second after a 60-second minimum.

Worked example. A team-isolated warehouse design:

warehouse	size	who uses	typical workload
`WH_ETL`	MEDIUM	nightly pipeline	one heavy `MERGE` per night
`WH_BI`	SMALL	dashboard tools	hundreds of small concurrent queries
`WH_ANALYSTS`	LARGE	ad-hoc SQL	occasional 10 B-row scans
`WH_ML`	XLARGE	feature pipeline	scheduled hourly batches

Step-by-step.

Each team's queries route to their own warehouse — a misbehaving analyst query cannot block the BI dashboard.
The ETL warehouse runs for ~45 min/night, then auto-suspends; you pay only for that window.
The BI warehouse stays warm during business hours with multi-cluster auto-scaling so 200 concurrent dashboards never queue.
The analyst warehouse spins up only when someone runs a big ad-hoc query.
All four warehouses read and write the same underlying tables — there is one source of truth.

Worked-example solution. Create and size a warehouse:

CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60     -- pause after 60s idle
         AUTO_RESUME    = TRUE
         MIN_CLUSTER_COUNT = 1
         MAX_CLUSTER_COUNT = 4
         SCALING_POLICY = 'STANDARD';

Rule of thumb: every team gets their own warehouse named after the team. Cost attribution and noise isolation come together that way.

Cloud services layer — the brain that ties it all together

The services invariant: the cloud services layer handles authentication, query optimisation, metadata, transaction management, security, and access control; it is shared across all warehouses; you never interact with it directly but it powers every "Snowflake feels magic" experience. It is also what makes zero-copy cloning, secure data sharing, and Time Travel cheap.

Authentication & RBAC — users, roles, grants; role-based access at every level.
Query optimiser — column statistics + cost model produce the execution plan.
Metadata store — micro-partition stats, transaction log, history.
Result cache — recent query results returned without warehouse compute.
Background services — re-clustering, materialised view maintenance.

Worked example. A query lifecycle:

step	layer	action
1	cloud services	authenticate user, check role grants
2	cloud services	parse SQL, compile execution plan, fetch metadata
3	cloud services	check result cache — if hit, return immediately (no compute)
4	virtual warehouse	nodes fetch needed micro-partitions from storage
5	virtual warehouse	run scan / filter / aggregate in parallel
6	cloud services	gather results, cache, return to client

Step-by-step.

The client submits SQL with a session token; cloud services verifies the token and resolves grants.
The optimiser uses table metadata (partition stats, clustering, statistics) to pick the cheapest plan.
Result cache — if the same SQL on the same data was answered in the last 24 h, the result is returned instantly with no warehouse usage.
On a miss, the active warehouse spins up nodes that fetch the needed columns from object storage.
Compute aggregates and returns; the result is cached and the warehouse goes back to idle.

Worked-example solution. Result-cache demo:

ALTER SESSION SET USE_CACHED_RESULT = TRUE;        -- default
SELECT COUNT(*) FROM fact_orders;                  -- first run: 4 s warehouse compute
SELECT COUNT(*) FROM fact_orders;                  -- second run: 60 ms cache hit

Rule of thumb: if a repeated query suddenly takes seconds again, suspect that someone modified the underlying table — cache invalidates on any change.

Common beginner mistakes

Confusing virtual warehouses with databases — a warehouse is compute, a database is storage; both are needed.
Sizing the warehouse for the peak instead of the average — bigger warehouses cost linearly more; right-size and use multi-cluster scaling.
Leaving warehouses without auto-suspend — every idle minute is a real charge.
Putting every team's queries on one warehouse — one slow query starves everyone.
Forgetting the result cache exists — re-running benchmarks without flushing the cache reports unrealistic numbers.

Snowflake Interview Question on Designing a Multi-Team Warehouse Strategy

A 50-person data team complains that "Snowflake is slow at 9 AM." Everyone shares one XLARGE warehouse: ETL, analysts, BI, ML. Propose a multi-warehouse design that fixes the 9 AM contention without paying more in total credits.

Solution Using Per-Team Warehouses with Auto-Suspend and Multi-Cluster Scaling

Code solution.

-- ETL: heavy, scheduled, short bursts
CREATE WAREHOUSE WH_ETL
    WITH WAREHOUSE_SIZE = 'MEDIUM'
         AUTO_SUSPEND   = 60
         AUTO_RESUME    = TRUE;

-- BI: many small queries, business-hours concurrency
CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60
         MIN_CLUSTER_COUNT = 1
         MAX_CLUSTER_COUNT = 4;       -- multi-cluster for concurrency

-- Analysts: occasional big queries
CREATE WAREHOUSE WH_ANALYSTS
    WITH WAREHOUSE_SIZE = 'LARGE'
         AUTO_SUSPEND   = 60;

-- ML: scheduled feature jobs
CREATE WAREHOUSE WH_ML
    WITH WAREHOUSE_SIZE = 'XLARGE'
         AUTO_SUSPEND   = 60;

Step-by-step trace.

step	observation
1	original shared XLARGE
2	split into 4 warehouses, each right-sized
3	`AUTO_SUSPEND = 60` everywhere
4	`WH_BI` adds multi-cluster scaling
5	total credits/day drops
6	nobody waits behind another team's query

Output: the 9 AM dashboard lag disappears (BI auto-scales horizontally), the nightly ETL stops fighting the analyst's ad-hoc queries, and the total credit bill drops because nothing is "always on" anymore.

Why this works — concept by concept:

Per-team warehouses — each team's queries route to their own compute; nobody else can starve them.
Right-sized warehouses — BI needs concurrency (multi-cluster small); analysts need vertical power (large); they are not the same shape.
AUTO_SUSPEND = 60 — the silver bullet of Snowflake cost — warehouses billing stops 60s after the last query.
Multi-cluster scaling on WH_BI — additional clusters spin up when queue depth grows, then drop when it falls; no human tuning.
Same storage, isolated compute — all four warehouses read identical tables; one source of truth.
Cost — moves billing from "one big always-on warehouse" to "many right-sized warehouses billed only while running"; typical savings 30–60%.

Inline CTA: see ETL System Design for Data Engineering Interviews for end-to-end warehouse-shaping playbooks.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

3. Separation of compute and storage

Independent scaling of warehouses and data — the single most important Snowflake idea

In a legacy warehouse (Teradata, classic Redshift), compute and storage are bolted together — you buy a "cluster" with both, and if you need more of either you have to buy both. Snowflake's defining decision is that compute (virtual warehouses) and storage (cloud object storage) scale independently: spin up a XLARGE warehouse for a one-hour backfill, then drop back to SMALL; add a petabyte of data without touching compute; never pay for capacity you are not using right now.

Pro tip: This is the single most-asked Snowflake interview question, in some form, across data-engineering loops. Memorise the one-line answer: "Compute lives in virtual warehouses that I size and suspend independently; storage lives once on object storage and every warehouse reads the same files."

How the scaling actually works

The scaling invariant: adding a node to a warehouse, resizing a warehouse, or creating a new warehouse never moves data; the new compute simply fetches the same micro-partitions from object storage. The implication is huge — you can resize compute in seconds (no rebalancing) and the data layer never blocks an operational change.

Resize — ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE='LARGE' — takes seconds; no data motion.
New warehouse — creates a new cluster pointing at the same storage; you can have N warehouses on the same data.
Suspend — ALTER WAREHOUSE WH_ETL SUSPEND — stops the compute meter; data remains on object storage.
Resume — ALTER WAREHOUSE WH_ETL RESUME — spins compute back up in seconds.
Storage grows independently — adding 1 PB does not change warehouse sizing.

Worked example. A one-hour backfill at the end of the quarter:

time	warehouse size	what's running
00:00–08:00	`MEDIUM`	normal nightly ETL
08:00–09:00	resized to `XLARGE` (8× faster)	one-time quarterly backfill
09:00 onwards	back to `MEDIUM`	resume normal work

Step-by-step.

The team has a 2 B-row backfill that would take 8 hours on the regular MEDIUM warehouse.
At 08:00 the operator runs ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE='XLARGE'.
The warehouse scales from 4 to 64 nodes in seconds — no data moves.
The backfill completes in ~1 hour because compute is 8× larger.
At 09:00 the operator runs SET WAREHOUSE_SIZE='MEDIUM' and the credit bill goes back to the steady-state rate.

Worked-example solution. Temporary upsize for a backfill:

ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE = 'XLARGE';

-- run the backfill
INSERT INTO fact_orders_history
SELECT * FROM raw.orders WHERE order_date < '2026-01-01';

-- back to steady state
ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE = 'MEDIUM';

Rule of thumb: resize for the slow query, then resize back. Snowflake makes this a 5-minute operation, not a migration.

Why this breaks legacy assumptions

The legacy-comparison invariant: Teradata, classic Redshift, and on-prem MPP warehouses all couple compute and storage; adding storage means adding compute, and resizing compute requires a data rebalance. Snowflake's separation removes both constraints — the cost model and operational model are fundamentally different.

Coupled (legacy) — pay for over-provisioned compute year-round to handle peak storage.
Coupled — resize = rebalance = downtime.
Decoupled (Snowflake) — pay for compute by the second, only when running.
Decoupled — resize = seconds; new warehouse = seconds; no data motion.

Worked example. Same workload, two architectures:

dimension	legacy MPP (Teradata / old Redshift)	Snowflake
add 1 TB of data	requires bigger cluster	no compute change needed
resize compute	hours of rebalance	seconds
dev/test workload	needs own cluster	new warehouse on same data
paying for peak	always	only during the peak

Step-by-step.

A legacy 8-node Teradata cluster sized for the year-end peak runs at 20% utilisation the other 51 weeks.
The same workload on Snowflake uses a SMALL warehouse 51 weeks of the year, scales to XLARGE for one week, then back.
Storage costs are roughly the same (both are object-store class).
Compute costs drop ~70% because you only pay for the XLARGE during the week it is needed.
New environments (dev, staging) are free — they are just new warehouse names pointing at the same storage (or a clone).

Worked-example solution. Dev environment using a zero-copy clone:

CREATE DATABASE PROD_DW_DEV CLONE PROD_DW;
-- new warehouse for the dev team
CREATE WAREHOUSE WH_DEV WITH WAREHOUSE_SIZE = 'XSMALL';
USE DATABASE PROD_DW_DEV;
USE WAREHOUSE WH_DEV;

Rule of thumb: if a Snowflake setup ever "feels like" a legacy MPP cluster — always-on, hard to resize, single-tenant — it is being run wrong.

Cost implications and credit economics

The credit invariant: storage cost is roughly constant (compressed columnar on cloud storage); compute cost is variable and dominated by how long warehouses run; suspending warehouses and right-sizing them is the single highest-leverage cost lever Snowflake gives you. The default account settings are not always cost-optimal; tuning them matters.

Credits — Snowflake's compute currency; price varies by region and edition.
Warehouse credit-per-hour — XS=1, S=2, M=4, L=8, XL=16 (linear with size).
AUTO_SUSPEND — default may be 10 min; set to 60 s for spiky workloads.
Storage — flat $/TB/month; minor compared to compute for most workloads.
Result cache — free; queries served from cache don't burn credits.

Worked example. Monthly bill comparison:

design	warehouse	hours/month running	credits	cost
naive — always-on XL	XLARGE	730	11,680	$35,040
auto-suspended XL	XLARGE	80	1,280	$3,840
right-sized + auto-suspend	MEDIUM most, XL spike	60	320	$960

Step-by-step.

Always-on XL: 730 hours × 16 credits/hr × $3/credit = $35 k/month.
Same XL but with AUTO_SUSPEND = 60s: only runs when queries are active; ~80 hours/month → $3,840.
Right-sized — MEDIUM for the steady state, XL only for the weekly backfill: ~$960.
The data is identical; only the compute schedule changes.
Result cache further reduces this for repeated queries.

Worked-example solution. Cost-aware warehouse config:

CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60
         AUTO_RESUME    = TRUE
         INITIALLY_SUSPENDED = TRUE;

Rule of thumb: the first rule of Snowflake cost is suspend warehouses; the second rule is right-size warehouses; everything else is rounding error.

Common beginner mistakes

Resizing a warehouse and waiting for "data to move" — it does not; the resize is metadata-only.
Running XLARGE always-on for occasional queries — pay for an XSMALL 23 hours a day and an XLARGE for the one hour it is needed.
Treating the result cache as a free pass for "fast" queries that are actually expensive on a cold cache.
Ignoring AUTO_SUSPEND — the default of 10 minutes is wasteful for low-frequency workloads.
Building a single shared warehouse for everyone — undoes the entire isolation benefit.

Snowflake Interview Question on Cost-Optimising a $50k Monthly Bill

The CFO points at a $50k/month Snowflake bill. Your single XLARGE warehouse is AUTO_SUSPEND = NULL (never suspends). Average usage is 4 hours/day across two distinct workloads (BI in business hours, ETL at night). Cut the bill by at least 60% without losing performance.

Solution Using Workload Isolation + Auto-Suspend + Right-Sizing

Code solution.

-- Stop the always-on XL
ALTER WAREHOUSE WH_OLD SUSPEND;
DROP WAREHOUSE WH_OLD;

-- BI: business-hours, many small queries
CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60
         MIN_CLUSTER_COUNT = 1
         MAX_CLUSTER_COUNT = 3;

-- ETL: nightly, single big batch
CREATE WAREHOUSE WH_ETL
    WITH WAREHOUSE_SIZE = 'LARGE'
         AUTO_SUSPEND   = 60;

Step-by-step trace.

step	observation	monthly cost
1	original `XLARGE` always-on	730 h × 16 credits × $3 ≈ $35k/month per warehouse
2	usage audit: 4 BI hours/day + 1 ETL hour/night	most of the bill was idle time
3	split into BI SMALL + ETL LARGE	both auto-suspend
4	BI: 4 h/day × 30 = 120 h × 2 credits = 240 credits	~$720
5	ETL: 1 h/night × 30 = 30 h × 8 credits = 240 credits	~$720
6	new total	$1,440 ≈ 97% reduction

Output: monthly Snowflake spend drops from $50k to roughly $1.5k. BI users still see sub-second dashboards (multi-cluster scaling absorbs the morning spike). ETL still completes in its nightly window (LARGE is fast enough). Nothing breaks.

Why this works — concept by concept:

Workload isolation — BI and ETL have different concurrency profiles; one warehouse cannot serve both well.
AUTO_SUSPEND = 60 — the warehouse meter stops 60s after the last query; idle time is no longer paid for.
Right-sizing — BI gets SMALL with multi-cluster (concurrency); ETL gets LARGE (throughput). No need for XL.
Same storage — no data motion; both warehouses read the same tables.
Visible per-warehouse cost — separate warehouses surface per-team spend in WAREHOUSE_METERING_HISTORY.
Cost — credit consumption proportional to active query time, not wall-clock time.

Inline CTA: drill the ETL practice page for warehouse-sizing scenarios.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

4. Loading and querying data

Stages, Snowflake COPY INTO, file formats, and the Snowflake SQL surface

The Snowflake COPY INTO command is the primary bulk-load mechanism — it reads files from a stage (an internal or external file location) and inserts them into a table in parallel. The file format is declared explicitly (CSV, JSON, Parquet, Avro, ORC). Once data is in, you query it with standard Snowflake SQL — SELECT / JOIN / GROUP BY look identical to the dialect you already know.

Pro tip: Interviewers love the COPY INTO question because it has clear right answers — file format, error handling, parallelism, and idempotency are all observable design choices. Practise saying "I stage the files, declare the format, and run COPY INTO with ON_ERROR = SKIP_FILE_AND_CONTINUE and a load history check" in one sentence.

Stages — external and internal file locations

The stage invariant: a stage is a named file location Snowflake knows how to read from; **internal stages live inside Snowflake (managed for you); external stages point at S3 / GCS / ADLS buckets you manage; both behave identically for COPY INTO**. Stages are also reusable — one stage definition can be reused by many COPY INTO statements.

Internal stage — @~/path (user), @%TABLE (table), @stage_name (named).
External stage — points at s3://bucket/path/, gs://bucket/path/, azure://….
Storage integration — security object that grants Snowflake permission to read the bucket.
Listing — LIST @my_stage; shows files visible in the stage.

Worked example. Define an external S3 stage:

object	purpose
`STORAGE INTEGRATION`	IAM trust between Snowflake and AWS
`FILE FORMAT`	declares CSV / JSON / Parquet rules
`EXTERNAL STAGE`	named location pointing at the bucket
`COPY INTO`	the load command that uses the stage + format

Step-by-step.

Create a STORAGE INTEGRATION in Snowflake; this generates an IAM trust policy you paste into AWS.
Create a FILE FORMAT describing the data — TYPE = PARQUET is the simplest; CSV needs more options.
Create an EXTERNAL STAGE that combines the integration and the bucket path.
List files in the stage to confirm permissions: LIST @prod_s3_stage.
Run COPY INTO against the stage; Snowflake fetches files in parallel across warehouse nodes.

Worked-example solution. End-to-end stage setup:

CREATE STORAGE INTEGRATION s3_int
    TYPE = EXTERNAL_STAGE
    STORAGE_PROVIDER = 'S3'
    ENABLED = TRUE
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeReadRole'
    STORAGE_ALLOWED_LOCATIONS = ('s3://my-bucket/snowflake/');

CREATE FILE FORMAT ff_parquet
    TYPE = PARQUET;

CREATE STAGE prod_s3_stage
    STORAGE_INTEGRATION = s3_int
    URL = 's3://my-bucket/snowflake/'
    FILE_FORMAT = ff_parquet;

Rule of thumb: one storage integration per AWS account, one stage per logical bucket path, one file format per file shape — keeps grants and schemas tidy.

`COPY INTO` — the bulk loader

The COPY invariant: COPY INTO table FROM @stage parallelises file ingestion across all nodes of the active warehouse; errors are handled by ON_ERROR policy; COPY INTO is idempotent on a per-file basis — re-running the command will skip files already loaded (tracked in LOAD_HISTORY). The same command works for any file format the stage's file format declared.

Parallelism — file count × warehouse nodes; more files + bigger warehouse = faster load.
ON_ERROR — CONTINUE (skip rows), SKIP_FILE_AND_CONTINUE, ABORT_STATEMENT (fail fast).
PATTERN — regex filter on filenames; load only .parquet etc.
Load history — LOAD_HISTORY view records every file's commit; re-running skips loaded files.
PURGE = TRUE — delete source files after successful load.

Worked example. Daily Parquet drop into fact_orders:

stage prefix	file	loaded?
`s3://…/orders/dt=2026-05-10/`	`part-0000.parquet`	✓ (yesterday's run)
`s3://…/orders/dt=2026-05-11/`	`part-0000.parquet`	new
`s3://…/orders/dt=2026-05-11/`	`part-0001.parquet`	new

Step-by-step.

The previous day's COPY INTO loaded dt=2026-05-10/part-0000.parquet; it appears in LOAD_HISTORY.
Today's COPY INTO runs against the same stage path.
Snowflake consults LOAD_HISTORY, sees dt=2026-05-10/part-0000.parquet was already loaded, and skips it.
The two new files for dt=2026-05-11/ are loaded in parallel.
Re-running tonight's command would skip every file because all three are now in LOAD_HISTORY — idempotency.

Worked-example solution. Daily idempotent load:

COPY INTO fact_orders
FROM @prod_s3_stage/orders/
FILE_FORMAT = (FORMAT_NAME = ff_parquet)
PATTERN = '.*[.]parquet'
ON_ERROR = 'SKIP_FILE_AND_CONTINUE';

-- inspect what loaded
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'FACT_ORDERS',
    START_TIME => DATEADD(hours, -1, CURRENT_TIMESTAMP)
));

Rule of thumb: always wrap COPY INTO with a post-load row-count assertion and an alert on LOAD_HISTORY errors — silent skips are how data drifts.

File formats — CSV vs JSON vs Parquet

The format invariant: Parquet and other columnar formats (ORC) load faster, compress better, and preserve types; CSV is the lowest-common-denominator and pays a real cost in load time and schema fidelity; JSON works with VARIANT columns and is fine for semi-structured payloads. Default to Parquet for anything you control.

Parquet / ORC — columnar; preserves types; smallest on-disk size; fastest load.
CSV — text; needs explicit schema + quote / escape rules; slowest.
JSON — semi-structured; loads into a VARIANT column; query with :key dot notation.
Avro — common in streaming; binary; schema embedded.

Worked example. Loading the same 1 M-row dataset:

format	file size	COPY time on MEDIUM	notes
CSV (gzipped)	220 MB	90 s	requires explicit format definition
JSON (gzipped)	280 MB	70 s	landing into `VARIANT` column
Parquet (snappy)	60 MB	12 s	columnar; types preserved

Step-by-step.

CSV file is largest and slowest because every row is reparsed as text, type-coerced, and validated.
JSON is similar but lands into a single VARIANT column — fast for sparse / nested data, awkward for analytics SQL.
Parquet is columnar and binary; Snowflake reads only the columns it needs; load time drops 7×.
Storage costs follow the same ratio — Parquet files compress better.
If your source format is a choice (ETL between systems you control), pick Parquet.

Worked-example solution. Three file-format definitions:

CREATE OR REPLACE FILE FORMAT ff_csv
    TYPE = CSV
    FIELD_DELIMITER = ','
    SKIP_HEADER = 1
    NULL_IF = ('','NULL','null')
    FIELD_OPTIONALLY_ENCLOSED_BY = '"';

CREATE OR REPLACE FILE FORMAT ff_json
    TYPE = JSON
    STRIP_OUTER_ARRAY = TRUE;

CREATE OR REPLACE FILE FORMAT ff_parquet
    TYPE = PARQUET;

Rule of thumb: between systems you control, Parquet. Across vendor boundaries you cannot change, CSV. For event streams, JSON or Avro.

Common beginner mistakes

Loading raw CSVs without explicit FILE FORMAT — fields with embedded commas break silently.
Running COPY INTO without ON_ERROR — one bad row aborts the entire load.
Forgetting LOAD_HISTORY and reloading files twice — your fact table doubles silently.
Picking JSON for tabular data — wastes Snowflake's columnar strengths.
Storing the access keys in code instead of a STORAGE INTEGRATION — leaks the secret.

Snowflake Interview Question on a Daily S3 Drop That Sometimes Has Bad Rows

The team gets a daily 5 GB CSV drop from a partner into S3. About 0.1% of rows have malformed amounts. The current COPY INTO errors and the daily load fails. Design an ingestion pipeline that loads the good rows, captures the bad ones for review, and is idempotent on rerun.

Solution Using `ON_ERROR = CONTINUE` + a Rejected-Rows Table + `LOAD_HISTORY` Check

Code solution.

CREATE TABLE raw_orders (
    order_id   NUMBER,
    customer_id NUMBER,
    amount     NUMBER(14,2),
    placed_at  TIMESTAMP_NTZ
);

CREATE TABLE rejected_orders LIKE raw_orders;

-- main load: skip bad rows but keep going
COPY INTO raw_orders
FROM @prod_s3_stage/orders/
FILE_FORMAT = (FORMAT_NAME = ff_csv)
PATTERN = '.*[.]csv'
ON_ERROR = 'CONTINUE'
RETURN_FAILED_ONLY = FALSE;

-- capture the rejected rows for review
INSERT INTO rejected_orders
SELECT *
FROM TABLE(VALIDATE(raw_orders, JOB_ID => '_LAST'));

Step-by-step trace.

step	action	result
1	partner drops `orders_2026-05-11.csv`	5 M rows; ~5 k malformed
2	`COPY INTO` with `ON_ERROR = CONTINUE`	loads 4,995,000 good rows
3	`LOAD_HISTORY` shows file committed	won't reload on rerun
4	`VALIDATE(... JOB_ID => '_LAST')` returns 5 k bad rows	inserted into `rejected_orders`
5	partner sees rejection report, fixes upstream	next day cleaner

Output: the daily load completes; good rows land in raw_orders; bad rows are captured in rejected_orders with their reasons for inspection; LOAD_HISTORY ensures the same file is never loaded twice on rerun.

Why this works — concept by concept:

ON_ERROR = CONTINUE — partial loads succeed; one bad row doesn't kill the daily pipeline.
VALIDATE(... JOB_ID => '_LAST') — captures rejected rows from the most recent load for forensic review.
LOAD_HISTORY idempotency — reruns skip files already committed; safe to retry.
Separate rejected_orders table — keeps the failure rate visible and reviewable, not silently lost.
Pattern-based file selection — .*[.]csv ensures only intended files load.
Cost — load time O(rows / warehouse size); the validate call is metadata-only.

Inline CTA: the canonical ingestion-design syllabus is in ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

5. Time Travel and zero-copy cloning

Recovery, audit, and instant dev environments

Snowflake's Time Travel lets you query a table as it existed at any point in the recent past (1 day by default, up to 90 days on Enterprise). Zero-copy cloning lets you create a new database, schema, or table that shares the same underlying micro-partitions as the source — no data is copied, the clone is free, and edits diverge from that moment onward. Snowflake Dynamic Tables layer on top of these primitives to give you declarative, automatically-refreshed materialised views for the ELT layer. Together, these features turn data recovery, dev-environment provisioning, and forensic debugging from multi-hour ordeals into single SQL statements.

Pro tip: When asked "what makes Snowflake different operationally?", the two-word answer is "Time Travel and cloning." Both are direct consequences of the immutable-micro-partition storage layer; legacy warehouses cannot offer them because their storage isn't shaped this way.

Time Travel — querying historical state

The Time-Travel invariant: for every table, Snowflake retains the micro-partitions that made up its state for a retention period (default 1 day on Standard, configurable up to 90 days on Enterprise); within that window, AT (TIMESTAMP => …) or BEFORE (STATEMENT => …) clauses return the table's historical state. The feature is the cheapest "we accidentally dropped a table" recovery on the market.

AT (OFFSET => -3600) — table state 1 hour ago.
AT (TIMESTAMP => '2026-05-11 14:00:00') — table state at that exact moment.
BEFORE (STATEMENT => '<query_id>') — table state just before a specific query ran.
DATA_RETENTION_TIME_IN_DAYS — column-level setting; default 1, max 90 (Enterprise).
UNDROP TABLE / DATABASE — short-cut to restore a dropped object within retention.

Worked example. A junior accidentally truncates dim_customer:

time	event	what Time Travel can do
14:00:00	`dim_customer` healthy	(normal)
14:05:32	`TRUNCATE TABLE dim_customer`	rows gone
14:07:10	data team notices	panic
14:08:00	run `INSERT INTO dim_customer SELECT * FROM dim_customer AT (TIMESTAMP => '2026-05-11 14:05:00')`	rows restored

Step-by-step.

The truncate runs; Snowflake marks the micro-partitions as expired but retains them for the retention window.
The team finds the query in QUERY_HISTORY and notes its query_id.
SELECT * FROM dim_customer BEFORE (STATEMENT => 'abc-123-def') returns the table as it existed just before the truncate.
Wrapping the same query in an INSERT INTO dim_customer restores the data in seconds — no backup tape, no S3 restore.
Future incidents within the retention window are recoverable the same way.

Worked-example solution. Full recovery script:

-- find the offending query
SELECT query_id, query_text, start_time
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%TRUNCATE%dim_customer%'
ORDER BY start_time DESC
LIMIT 1;

-- restore using BEFORE
INSERT INTO dim_customer
SELECT * FROM dim_customer BEFORE (STATEMENT => 'abc-123-def-456');

Rule of thumb: Time Travel saves your weekend the first time someone runs a destructive query in prod. Configure retention to match your "how long until someone notices" SLA.

Zero-copy cloning — instant dev / test environments

The cloning invariant: CREATE … CLONE produces a new object (database, schema, or table) that shares the source's underlying micro-partitions; until either side writes, the clone is free in storage; once a clone writes, only the diverged partitions cost extra. Cloning a 100 TB database for a feature branch takes seconds and costs near-zero.

CREATE TABLE x_clone CLONE x — clone one table.
CREATE SCHEMA s_dev CLONE s_prod — clone all tables in a schema.
CREATE DATABASE db_dev CLONE db_prod — clone an entire database.
Copy-on-write — clones diverge only on the rows that actually change.
Clone-at-point-in-time — CLONE … AT (TIMESTAMP => …) — combines cloning and Time Travel.

Worked example. Stand up a dev environment in 30 seconds:

step	action	storage cost
1	prod `DB_PROD` has 100 TB of orders	$$$
2	`CREATE DATABASE DB_DEV CLONE DB_PROD`	0 (shared micro-partitions)
3	dev team runs `INSERT` and `UPDATE` over a few thousand rows	+ a few MB diverged
4	prod queries still see prod state	isolated
5	dev queries see clone + diverged state	isolated

Step-by-step.

CREATE DATABASE DB_DEV CLONE DB_PROD returns in seconds.
Snowflake records that every table in DB_DEV points at the same micro-partitions as DB_PROD.
Dev team can run any DDL or DML against DB_DEV without touching prod data.
Each write to DB_DEV writes new partitions; the unchanged partitions remain shared.
When dev is done, DROP DATABASE DB_DEV removes only the diverged partitions; the shared ones stay with prod.

Worked-example solution. Feature-branch dev environment:

-- Snapshot prod at a clean moment
CREATE DATABASE DB_DEV_FEATURE_X
    CLONE DB_PROD
    AT (TIMESTAMP => CURRENT_TIMESTAMP());

GRANT USAGE  ON DATABASE DB_DEV_FEATURE_X TO ROLE dev_role;
GRANT USAGE  ON SCHEMA   DB_DEV_FEATURE_X.public TO ROLE dev_role;
GRANT SELECT ON ALL TABLES IN SCHEMA DB_DEV_FEATURE_X.public TO ROLE dev_role;

Rule of thumb: every dev team gets their own clone. The cost of cloning is so low that "no shared dev environment" should be your default policy.

Retention windows and the cost of long Time Travel

The retention-cost invariant: retaining expired micro-partitions for the Time-Travel window costs storage, not compute; longer retention = more storage; the math is small for tables that change slowly, larger for high-churn tables. Pick the retention window per table based on (a) how long it takes to notice mistakes and (b) how much the table churns.

Standard edition — max retention 1 day.
Enterprise edition — max 90 days; default still 1.
ALTER TABLE … SET DATA_RETENTION_TIME_IN_DAYS = N — per-table override.
Fail-safe — additional 7-day retention beyond Time Travel; Snowflake-managed, not user-queryable.
Cost driver — high-churn tables (frequent UPDATE / DELETE) accumulate many historical partitions.

Worked example. Retention cost per table:

table	churn (rows/day)	retention (days)	extra storage
`dim_customer`	low (1 k changes)	30	tiny
`fact_clicks` (insert-only)	high (50 M rows/day)	7	~350 M rows × 7
`dim_product` (rarely changes)	almost zero	90	tiny
`staging_*` (volatile)	high	1	minimal

Step-by-step.

For low-churn dimensions, longer retention costs almost nothing — those tables rarely write new partitions.
For high-churn facts, retention multiplies the storage cost proportionally.
Staging tables don't need 7+ days — they're rebuilt daily; set retention to 1.
Production facts that you'd want to recover from a logic bug deserve 7-14 days.
Per-table tuning is more cost-effective than a single account-wide retention.

Worked-example solution. Right-sized retention:

ALTER TABLE dim_customer  SET DATA_RETENTION_TIME_IN_DAYS = 30;
ALTER TABLE fact_clicks   SET DATA_RETENTION_TIME_IN_DAYS = 7;
ALTER TABLE staging_orders SET DATA_RETENTION_TIME_IN_DAYS = 1;

Rule of thumb: the longer retention, the longer your safety net; the longer retention on high-churn tables, the higher the storage bill. Tune per table.

Common beginner mistakes

Assuming Time Travel works forever — the default is 1 day; past that, you need Enterprise + per-table retention.
Treating fail-safe as user-queryable — it is not; only Snowflake support can recover from fail-safe.
Cloning to "back up" — clones share storage; if you DROP the source, the clone is unaffected, but they aren't a true off-cluster backup.
Forgetting that updates erode retention — a heavy UPDATE on a fact table can blow up storage costs if retention is long.
Querying historical data with AT (OFFSET => -86400) when the table's retention is 0 — silently errors.

Snowflake Interview Question on Recovering a Mistakenly Dropped Production Table

A junior runs DROP TABLE dim_customer; in prod at 14:05:32 UTC. The team notices at 14:08:00. The account is on Enterprise edition; the dim_customer table has 90-day retention configured. Recover the table with zero data loss and zero downtime for downstream dashboards.

Solution Using `UNDROP TABLE` (and a fallback to `CLONE … AT (TIMESTAMP => …)`)

Code solution.

-- fast path: UNDROP restores the table object and its data
UNDROP TABLE dim_customer;

-- or, if the table name has already been reused, clone the historical state
CREATE TABLE dim_customer_restored
    CLONE dim_customer AT (TIMESTAMP => '2026-05-11 14:05:00'::TIMESTAMP);

-- verify row counts vs the source-of-truth replica
SELECT COUNT(*) FROM dim_customer;

Step-by-step trace.

step	time	action	result
1	14:05:32	`DROP TABLE dim_customer` runs	table dropped; metadata moved to "dropped"
2	14:08:00	engineer notices	table still recoverable via Time Travel
3	14:08:30	`UNDROP TABLE dim_customer`	table restored with full data
4	14:08:45	`SELECT COUNT(*) FROM dim_customer`	matches the row count before drop
5	14:09:00	downstream dashboards re-run	green

Output: the table is back in place with every row intact; downstream queries that fired between 14:05:32 and 14:08:30 errored but those errors are transient and the next refresh succeeds; total recovery time ≈ 3 minutes.

Why this works — concept by concept:

UNDROP TABLE — Snowflake's shortcut for restoring a dropped object within the retention window; one statement, instant.
90-day retention — purchased via Enterprise edition; absorbs the worst-case "we noticed a week later" recovery scenario.
CLONE … AT (TIMESTAMP => …) — fallback if the table name was reused; recreates the historical state as a new table.
Zero data motion — the dropped table's micro-partitions never left storage; recovery is metadata-only.
No external backup needed — Time Travel is the backup for the retention window.
Cost — restore is metadata-only; the retention storage cost was paid throughout the 90 days regardless of whether anyone used it.

Inline CTA: drill recovery and data-quality scenarios on the ETL practice page.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

6. Performance optimization

Micro-partitions, query pruning, result caching, and clustering

Snowflake's performance story is built on three automatic layers — micro-partitions (the storage unit), query pruning (skip partitions whose stats prove they cannot match), and result caching (serve identical recent queries with no compute). On top of that, you can guide the optimiser with clustering keys for very large tables that need predictable partition layout. You will rarely tune indexes (there aren't any) — instead, you tune which partitions exist and which the planner can skip.

Pro tip: When a Snowflake query is slow, the right diagnostic is the query profile in the UI — look at partitions scanned vs total. If you're scanning 100% of partitions for a date-range query, the table either lacks a useful cluster key or the predicate isn't pruneable.

Micro-partitions and automatic clustering

The micro-partition invariant: Snowflake automatically chops every table into immutable 50–500 MB compressed columnar files, each carrying per-column min/max/distinct statistics; "clustering" is the optional act of giving Snowflake a hint about which column(s) should drive partition layout for predictable date-range or key-range pruning. Most tables don't need explicit clustering; the very large ones do.

Automatic partitioning — every insert produces new partitions; no DDL needed.
Statistics per partition — min/max/distinct for every column; powers pruning.
CLUSTER BY (col) — hint that Snowflake should keep partitions ordered on that column.
Re-clustering — background service that recompacts partitions when clustering drifts.
Pruning ratio — partitions scanned / total partitions; visible in query profile.

Worked example. A fact_clicks table at 50 B rows:

design	predicate `WHERE click_date = '2026-05-10'`	partitions scanned
no cluster key	natural append order	100% (no pruning)
`CLUSTER BY (click_date)`	partitions sorted by date	0.3%

Step-by-step.

Without clustering, Snowflake's per-partition min/max for click_date covers the whole date range — every partition might match.
With CLUSTER BY (click_date), each partition's date range is tight; the planner can skip every partition outside the predicate.
The query goes from a full table scan to a needle-in-haystack pull.
Re-clustering runs in the background as new data arrives, keeping the layout tight.
The cluster key should match the most common range predicate, not every predicate.

Worked-example solution. Cluster a high-volume fact:

ALTER TABLE fact_clicks CLUSTER BY (click_date, customer_id);
-- check clustering depth (1.0 = perfectly clustered, higher = worse)
SELECT SYSTEM$CLUSTERING_INFORMATION('fact_clicks', '(click_date)');

Rule of thumb: most tables (< 1 TB) don't need explicit clustering. The very large date-partitioned facts do, and the cluster key is almost always the date column.

Query pruning — skip partitions whose stats prove they cannot match

The pruning invariant: the optimiser uses per-partition column statistics to prove that some partitions cannot contain rows matching the WHERE predicate; those partitions are skipped — never read from storage — and the query reads only the relevant subset. Pruning is automatic and invisible until you check the query profile.

Date predicates — WHERE date_col BETWEEN x AND y prunes by partition date range.
Equality predicates — WHERE col = x prunes by per-partition min/max.
IN (...) lists — pruned by each value.
Function-wrapped columns — WHERE DATE(ts) = … may not prune; raw column comparisons do.
Profile shows Partitions scanned : 12 / 4567 — that ratio is the pruning signal.

Worked example. Same fact_clicks query, different predicates:

predicate	partitions scanned
`WHERE click_date = '2026-05-10'`	12 / 4,567 (0.3%)
`WHERE customer_id = 4242`	4,567 / 4,567 (no pruning unless clustered)
`WHERE DATE(click_ts) = '2026-05-10'`	4,567 / 4,567 (function disables pruning)

Step-by-step.

Date predicate matches the cluster key; pruning is excellent.
Customer-id predicate cannot prune because customer_ids are scattered across all partitions.
Wrapping the date column in DATE(…) disables pruning because Snowflake cannot use min/max on the computed value.
The query profile makes this visible — "Partitions scanned: X / Y" is the first line to read.
The fix for the function-wrapped predicate is to compare the raw column: WHERE click_ts >= '2026-05-10' AND click_ts < '2026-05-11'.

Worked-example solution. Pruning-friendly date filter:

-- prunes
SELECT * FROM fact_clicks
WHERE click_ts >= '2026-05-10'
  AND click_ts <  '2026-05-11';

-- does NOT prune
SELECT * FROM fact_clicks
WHERE DATE(click_ts) = '2026-05-10';

Rule of thumb: keep predicates on the raw clustered column. Anything that wraps the column in a function disables the planner's ability to use partition statistics.

Result caching — free wins for repeated queries

The cache invariant: the cloud-services layer remembers the result of every query for 24 hours; an identical query against unchanged tables returns the cached result instantly, with zero warehouse compute. The cache is account-wide — different users running the same SQL share the same cached result.

Cache lifetime — 24 hours of inactivity; extends with re-use up to 31 days.
Cache key — exact SQL text + same underlying data state.
Invalidation — any change to a referenced table or any non-deterministic function (CURRENT_TIMESTAMP).
No warehouse needed — the cache responds even when the warehouse is suspended.
Cache misses — the warehouse runs the query and the result is cached for the next user.

Worked example. Three runs of the same dashboard query:

run	warehouse compute	latency	credit cost
1	warehouse runs	2,400 ms	small
2 (cache hit)	none	80 ms	0
3 (after data change)	warehouse runs	2,400 ms	small

Step-by-step.

First analyst clicks the dashboard tile; the warehouse runs the query; result cached.
Second analyst clicks the same tile minutes later; cache hit; warehouse is suspended the entire time.
ETL adds new rows to the underlying table; cache invalidates.
Third analyst clicks; cache miss; warehouse spins back up to recompute.
The pattern dominates BI workloads — most dashboard refreshes hit the cache because the data only changes once a day.

Worked-example solution. Verify cache behaviour:

SHOW PARAMETERS LIKE 'USE_CACHED_RESULT';   -- TRUE by default
ALTER SESSION SET USE_CACHED_RESULT = TRUE;

SELECT COUNT(*) FROM fact_orders;            -- compute
SELECT COUNT(*) FROM fact_orders;            -- cache hit (~80 ms)

Rule of thumb: never benchmark Snowflake without disabling result cache for the test. Production benefits hugely from the cache; benchmarks lie when you don't account for it.

Common beginner mistakes

Trying to create B-tree indexes — Snowflake has none; tuning happens via clustering and partition pruning.
Clustering small tables — overhead exceeds benefit until you cross ~1 TB.
Wrapping clustered columns in functions in WHERE — disables pruning silently.
Believing the result cache is "always on" — any underlying-data change invalidates.
Benchmarking with cache enabled — produces misleadingly low numbers; disable cache for honest tests.

Snowflake Interview Question on Speeding Up a 60-Second Daily Report

The daily revenue report on a 5 B-row fact_orders table takes 60 seconds. The query is SELECT customer_id, SUM(amount) FROM fact_orders WHERE order_date = CURRENT_DATE GROUP BY 1. Get it under 5 seconds without buying a bigger warehouse.

Solution Using Clustering on `order_date` + a Raw-Column Predicate + Materialised View

Code solution.

-- 1. cluster the table by the most common range predicate
ALTER TABLE fact_orders CLUSTER BY (order_date);

-- 2. rewrite the predicate so it can prune
SELECT customer_id, SUM(amount)
FROM fact_orders
WHERE order_date = CURRENT_DATE          -- raw column, prunes
GROUP BY customer_id;

-- 3. for repeated daily access, create a materialised view
CREATE MATERIALIZED VIEW mv_daily_customer_revenue AS
SELECT order_date, customer_id, SUM(amount) AS revenue
FROM fact_orders
GROUP BY 1, 2;

Step-by-step trace.

step	action	result
1	baseline scan of entire 5 B rows	60 s
2	`CLUSTER BY (order_date)` (background re-cluster runs)	tighter date min/max per partition
3	rerun query — pruning kicks in	4 s
4	result cache hit on repeated daily reads	< 100 ms
5	optional MV for sub-second pre-aggregated rollup	< 50 ms

Output: the daily report goes from 60 s to 4 s on the first run of the day (clustered scan), then milliseconds for repeat hits (result cache). The materialised view turns even cold runs into milliseconds for the pre-aggregated rollup.

Why this works — concept by concept:

CLUSTER BY (order_date) — partitions co-locate by date so the predicate prunes ~99.5% of them.
Raw-column predicate — WHERE order_date = CURRENT_DATE is pruneable; wrapping in a function would not be.
Result cache for repeats — second and subsequent analysts to view the dashboard get sub-100 ms without warehouse compute.
Materialised view — pre-aggregates the rollup; daily query becomes a tiny aggregate read.
No bigger warehouse needed — performance gains come from scanning less data, not throwing more compute at the same scan.
Cost — clustering adds background re-cluster cost; MV adds maintenance cost; both are far smaller than running a XLARGE warehouse for 60 s × N analysts.

Inline CTA: sharpen pruning-aware SQL on the SQL practice page and the aggregation topic.

SQL
Topic — aggregations
SQL aggregation problems

Practice →

SQL
Topic — window functions
Window-function problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

7. Snowflake vs Redshift vs BigQuery

How the big warehouses differ — and how to pick one

Three cloud data warehouses dominate modern data-engineering interviews: Snowflake, Amazon Redshift, and Google BigQuery. Snowflake vs Databricks is the next-most-asked comparison, and Azure shops often add Synapse to the shortlist — so be ready for any pairing in the wider Snowflake / Databricks / BigQuery / Synapse decision. All four serve analytical workloads at scale; they differ in operational model, pricing, cloud lock-in, and how they pair with transformation tools like dbt (the dbt Snowflake integration is the de-facto modeling layer for most Snowflake teams). Knowing the trade-offs in one or two sentences each is enough to handle the "why this one?" interview follow-up.

Pro tip: Never say "X is best." Always frame the answer as which tool fits which workload. Interviewers test whether you understand the trade-offs, not whether you can pick a winner.

Snowflake vs Redshift — compute/storage coupling and cloud lock-in

The Redshift comparison invariant: classic Redshift (provisioned) tightly couples compute and storage; modern Redshift Serverless decouples them and looks more like Snowflake; Snowflake is multi-cloud while Redshift is AWS-only; Snowflake's ease-of-use is consistently rated higher but Redshift can be cheaper at steady-state on AWS. Pick Redshift if you are deep in AWS and want the cheapest steady-state bill; pick Snowflake if you need multi-cloud, easier ops, or per-team isolation.

Cloud support — Snowflake: AWS / GCP / Azure. Redshift: AWS only.
Compute/storage — Snowflake: fully separated. Redshift: provisioned = coupled; Serverless = separated.
Maintenance — Snowflake: near-zero. Redshift: some tuning (VACUUM, ANALYZE, distkey, sortkey).
Scaling — Snowflake: easier; resize in seconds. Redshift: provisioned resize involves data redistribution.
Semi-structured — Snowflake: excellent native VARIANT. Redshift: good SUPER type.

Worked example. Same workload on both:

dimension	Snowflake	Redshift (provisioned)
spin up a warehouse	5 s	minutes
resize compute	seconds, no data motion	minutes, data redistribution
add a TB of data	no compute change	may need to resize cluster
10 concurrent dashboards	multi-cluster auto-scales	needs Concurrency Scaling addon
credit billing	per-second	per-hour (provisioned) / per-second (Serverless)

Step-by-step.

Both can serve the same analytical SQL workload at scale.
Snowflake's operational ergonomics are simpler — no VACUUM, no manual sort/distkeys, easier resize.
Redshift on AWS is often cheaper at steady-state because AWS sells it at a discount within their ecosystem.
Multi-cloud needs (data in GCP + analytics in AWS) lean strongly Snowflake.
The choice is rarely about features — it is about who manages the warehouse and how aggressive the cost target is.

Worked-example solution. Quick comparison line for an interview:

Snowflake : multi-cloud, separated compute/storage, near-zero maintenance,
            per-second billing, excellent VARIANT.
Redshift  : AWS-only, provisioned = coupled / Serverless = separated,
            some tuning required, cheaper at steady-state on AWS.

Rule of thumb: deep AWS shop with steady utilisation → Redshift. Multi-cloud, spiky workload, small ops team → Snowflake.

Snowflake vs BigQuery — warehouse compute vs serverless

The BigQuery comparison invariant: BigQuery is serverless — no warehouses, just a query that scans bytes and is billed per byte scanned; Snowflake bills per second of warehouse uptime; BigQuery is GCP-only; both have excellent semi-structured support; the cost model fundamentally differs. Pick BigQuery if you're on GCP and want zero compute management; pick Snowflake if you want multi-cloud or predictable monthly compute spend.

Cloud — Snowflake: AWS / GCP / Azure. BigQuery: GCP only.
Compute model — Snowflake: virtual warehouses. BigQuery: serverless slots.
Pricing — Snowflake: warehouse credits per second. BigQuery: per TB scanned (on-demand) or flat-rate slots.
Concurrency — Snowflake: explicit warehouse choice. BigQuery: implicit; slots dynamically assigned.
Cost predictability — Snowflake: more predictable (you choose the warehouse). BigQuery: depends entirely on query patterns.

Worked example. Cost shape for one query:

query	Snowflake (SMALL warehouse)	BigQuery (on-demand)
1 TB scan	60 s × 2 credits = $6	1 TB × $5/TB = $5
same query, 100×	warehouse keeps running	100 TB × $5 = $500
same query, result cache hit (Snowflake)	free	not applicable in BigQuery
same query, cached (BigQuery)	—	free for 24 h

Step-by-step.

For one-off queries, costs are similar.
For repeated identical queries, both have result caches that make subsequent runs free.
For repeated different queries on the same data, BigQuery scales with bytes scanned per query; Snowflake scales with warehouse uptime.
Heavy ad-hoc exploration may be cheaper on Snowflake (one warehouse, many queries) than BigQuery (each query bills bytes).
Heavy variable workloads with idle gaps may be cheaper on BigQuery (no warehouse to suspend).

Worked-example solution. Quick interview-shape comparison:

Snowflake : warehouses, per-second compute billing, multi-cloud,
            predictable cost if warehouses are right-sized.
BigQuery  : serverless, per-byte-scanned billing, GCP-only,
            cost scales with bytes per query — partition / cluster
            tables aggressively to keep bytes small.

Rule of thumb: GCP shop with many small queries → BigQuery. Multi-cloud or heavy ad-hoc analyst workload → Snowflake.

When to pick what — the one-paragraph decision

The selection invariant: pick the warehouse that matches (a) the cloud your data already lives in, (b) the workload shape (steady vs spiky, dashboards vs ad-hoc), and (c) the size of your data-engineering team; do not over-rotate on features — all three serve the analytical workload well. The bigger questions are operational and economic, not technical.

Cloud-first — match the warehouse to where the data lives (cross-cloud egress is real money).
Workload-first — bursty / spiky → Snowflake or BigQuery on-demand; steady → Redshift Serverless or BigQuery flat-rate.
Team-size-first — small team needs near-zero maintenance → Snowflake or BigQuery; bigger team can afford Redshift tuning.
Cost-first — steady AWS workload → Redshift; multi-cloud or bursty → Snowflake.

Worked example. Decision matrix for three scenarios:

scenario	best fit	reason
5 TB, AWS-only, steady analyst workload, small team	Snowflake or Redshift Serverless	both work; Snowflake is easier
500 GB, GCP-only, dashboards	BigQuery	native fit; no warehouse to size
50 TB, multi-cloud, weekly backfills	Snowflake	only one that's multi-cloud + handles spikes

Step-by-step.

Start with cloud — if you're locked to AWS or GCP, you've narrowed the options.
Then workload shape — steady-state vs bursty changes whether warehouse-based billing or per-byte billing is cheaper.
Then team size — smaller teams pay for managed-service simplicity in implicit hours saved.
Cost is the last check — all three are within 2× of each other for most workloads.
The "right" answer is rarely a technical one; it's the one your team can operate without burning out.

Worked-example solution. Decision flowchart in text:

Q: which cloud is the source data on?
   AWS only       → Redshift or Snowflake (Snowflake if multi-cloud likely)
   GCP only       → BigQuery or Snowflake
   Azure only     → Snowflake (BigQuery is GCP-only)
   Multi-cloud    → Snowflake

Q: workload shape?
   Steady, predictable      → flat-rate (Redshift Serverless / BigQuery flat-rate)
   Bursty, mostly idle      → on-demand (Snowflake auto-suspend / BigQuery on-demand)

Q: team size + ops appetite?
   Small + want easy        → Snowflake or BigQuery
   Big + want control       → Redshift provisioned

Rule of thumb: the right answer is the one your team can operate at 3 AM without paging an expert.

Common beginner mistakes

Declaring one warehouse "best" — the correct answer is always conditional on the workload.
Comparing on-demand BigQuery to provisioned Redshift — different cost models entirely.
Forgetting cross-cloud egress charges when picking a warehouse on a different cloud than your source.
Overestimating Snowflake's premium over Redshift — at steady state, the gap is often smaller than the operational savings.
Underestimating ease-of-use — engineering hours saved by a managed warehouse are real money.

Snowflake Interview Question on Choosing a Warehouse for a Specific Scenario

You're advising a startup: their product runs on AWS, they have ~10 TB of analytical data growing 1 TB/month, three full-time analysts, and a small data-engineering team. They want sub-second BI dashboards and a daily ETL. Budget is "reasonable, not unlimited." Pick one warehouse and defend the choice in two paragraphs.

Solution Using Snowflake on AWS with Per-Team Warehouses + Auto-Suspend

Code solution.

-- ETL warehouse
CREATE WAREHOUSE WH_ETL
    WITH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;

-- BI warehouse with multi-cluster for concurrency
CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE
         MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 3;

-- analyst warehouse for ad-hoc
CREATE WAREHOUSE WH_ANALYSTS
    WITH WAREHOUSE_SIZE = 'LARGE' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;

Step-by-step trace of the decision:

consideration	answer
source-data cloud	AWS — both Snowflake and Redshift fit
workload	mixed: nightly ETL (bursty) + BI (concurrent) + analyst (spiky)
team size	small data-eng team
size today	10 TB, growing 1 TB/month
operational burden tolerance	low
result	Snowflake on AWS

Output: the startup gets sub-second BI (result cache + multi-cluster WH_BI), idle warehouses auto-suspend (cost), the data-engineering team doesn't spend Friday afternoons on VACUUM and distkey tuning (no Redshift-style maintenance), and a future move off AWS doesn't force a warehouse migration.

Why this works — concept by concept:

Snowflake on AWS — keeps the data on the same cloud as the product; same-cloud egress is minimal.
Three warehouses — isolates ETL from BI from analysts; one slow query never blocks another.
AUTO_SUSPEND = 60s — idle compute is not billed; the warehouses run only when there's work.
Multi-cluster WH_BI — handles the 9 AM dashboard concurrency spike without a bigger warehouse.
Near-zero maintenance — no VACUUM, no distkey, no manual partition tuning; small team can run it.
Cost — proportional to actual usage; the always-on cost of a Redshift provisioned cluster is the worst fit for a startup's spiky workload.

Inline CTA: see the full ETL-and-warehouse playbook in ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

Choosing Snowflake (checklist)

If your workload looks like…	Snowflake is a good fit because…	Watch out for…
Analytical SQL over 10 GB–10 PB	Columnar storage + parallel compute	Tiny datasets are cheaper on DuckDB / SQLite
Multi-team concurrent dashboards	Per-team warehouses + multi-cluster	Forgetting `AUTO_SUSPEND`
Multi-cloud or cloud-agnostic	Runs on AWS, GCP, Azure	Cross-region egress costs
Spiky workloads with idle gaps	Per-second billing + auto-suspend	Always-on warehouses are wasteful
Need Time Travel + dev clones	Built-in, zero-cost cloning	Long retention on high-churn tables is expensive
Semi-structured JSON / Parquet	First-class `VARIANT` type + `COPY INTO` JSON	Storing tabular data as JSON wastes columnar

Pro tip: When you propose Snowflake in a system-design round, immediately name the three layers and the per-team warehouse split. Those two sentences turn a generic answer into one that signals you've actually run Snowflake in production.

Frequently asked questions

What is Snowflake?

Snowflake is a cloud-native data warehouse / data platform built on three independent layers — storage, compute (virtual warehouses), and cloud services. It runs as a managed service on AWS, GCP, and Azure, and is optimised for analytical SQL workloads over very large datasets.

What is a virtual warehouse?

A virtual warehouse is a named, sized, isolated compute cluster that runs your queries. Warehouses can be created, suspended, resumed, and resized independently of one another and independently of the data they read. The pricing model is per-second of warehouse uptime.

What is separation of compute and storage?

Snowflake stores every table on cloud object storage that is decoupled from any compute cluster. Compute (virtual warehouses) and storage scale independently — you can resize compute in seconds with no data motion and add petabytes of storage without changing compute sizing.

What is Time Travel?

Time Travel is the ability to query a table's historical state via AT (TIMESTAMP => …) or BEFORE (STATEMENT => …) clauses, within the table's retention window (1–90 days). It powers UNDROP TABLE and accidental-write recovery without external backups.

What is zero-copy cloning?

Zero-copy cloning uses CREATE … CLONE to produce a new database, schema, or table that shares the source's underlying micro-partitions. No data is copied at clone time; the clone diverges only when either side writes. Ideal for instant dev / test environments.

How does Snowflake compare to Redshift and BigQuery?

Snowflake runs on AWS, GCP, and Azure with fully separated compute and storage. Redshift is AWS-only and was historically coupled (Redshift Serverless changes that). BigQuery is serverless and GCP-only, billing per byte scanned. Pick by cloud, workload shape, and team-size, not by feature count.

How long does it take to learn Snowflake?

If your SQL fluency is solid, the core ideas (warehouses, separation of compute and storage, COPY INTO, Time Travel, cloning) take 1–2 weeks of focused practice. Advanced topics (clustering, materialised views, streams + tasks, multi-cluster tuning) take another 2–4 weeks of real-world use.

Practice on PipeCode

PipeCode ships 450+ data engineering practice problems — SQL uses the PostgreSQL dialect, with editorials and topics aligned to the same patterns Snowflake interviewers ask. Start from Explore practice →, open SQL practice →, filter by ETL → or aggregations →, and see plans → when you want the full library.

1. Why Snowflake matters

What is Snowflake database — cloud-native data warehousing for analytics at scale

Data warehouse vs OLTP database — different shapes for different jobs

Multi-cloud as a feature, not a buzzword

Real-world use cases — where Snowflake earns its keep

Common beginner mistakes

Snowflake Interview Question on Picking a Warehouse vs Database

Solution Using Postgres for OLTP + Snowflake for OLAP via Daily ELT

2. The three-layer architecture

Storage, compute (virtual warehouses), and cloud services — decoupled by design

Database storage layer — compressed columnar files on object storage

Compute layer — virtual warehouses run the queries

Cloud services layer — the brain that ties it all together

Common beginner mistakes

Snowflake Interview Question on Designing a Multi-Team Warehouse Strategy

Solution Using Per-Team Warehouses with Auto-Suspend and Multi-Cluster Scaling

3. Separation of compute and storage

Independent scaling of warehouses and data — the single most important Snowflake idea

How the scaling actually works

Why this breaks legacy assumptions

Cost implications and credit economics

Common beginner mistakes

Snowflake Interview Question on Cost-Optimising a $50k Monthly Bill

Solution Using Workload Isolation + Auto-Suspend + Right-Sizing

4. Loading and querying data

Stages, Snowflake COPY INTO, file formats, and the Snowflake SQL surface

Stages — external and internal file locations

COPY INTO — the bulk loader

File formats — CSV vs JSON vs Parquet

Common beginner mistakes

Snowflake Interview Question on a Daily S3 Drop That Sometimes Has Bad Rows

Solution Using ON_ERROR = CONTINUE + a Rejected-Rows Table + LOAD_HISTORY Check

5. Time Travel and zero-copy cloning

Recovery, audit, and instant dev environments

Time Travel — querying historical state

Zero-copy cloning — instant dev / test environments

Retention windows and the cost of long Time Travel

Common beginner mistakes

Snowflake Interview Question on Recovering a Mistakenly Dropped Production Table

Solution Using UNDROP TABLE (and a fallback to CLONE … AT (TIMESTAMP => …))

6. Performance optimization

Micro-partitions, query pruning, result caching, and clustering

Micro-partitions and automatic clustering

Query pruning — skip partitions whose stats prove they cannot match

Result caching — free wins for repeated queries

Common beginner mistakes

Snowflake Interview Question on Speeding Up a 60-Second Daily Report

Solution Using Clustering on order_date + a Raw-Column Predicate + Materialised View

7. Snowflake vs Redshift vs BigQuery

How the big warehouses differ — and how to pick one

Snowflake vs Redshift — compute/storage coupling and cloud lock-in

Snowflake vs BigQuery — warehouse compute vs serverless

When to pick what — the one-paragraph decision

Common beginner mistakes

Snowflake Interview Question on Choosing a Warehouse for a Specific Scenario

Solution Using Snowflake on AWS with Per-Team Warehouses + Auto-Suspend

Choosing Snowflake (checklist)

Frequently asked questions

What is Snowflake?

What is a virtual warehouse?

What is separation of compute and storage?

What is Time Travel?

What is zero-copy cloning?

How does Snowflake compare to Redshift and BigQuery?

How long does it take to learn Snowflake?

Practice on PipeCode

`COPY INTO` — the bulk loader

Solution Using `ON_ERROR = CONTINUE` + a Rejected-Rows Table + `LOAD_HISTORY` Check

Solution Using `UNDROP TABLE` (and a fallback to `CLONE … AT (TIMESTAMP => …)`)

Solution Using Clustering on `order_date` + a Raw-Column Predicate + Materialised View