Snowflake is the cloud-native data warehouse most modern data teams stand on — it stores petabytes, runs analytics in seconds, scales compute and storage independently, and ships features (Time Travel, zero-copy cloning, secure data sharing) that legacy MPP warehouses cannot match. For freshers preparing for data-engineering interviews, Snowflake is a high-leverage skill: the architecture is fundamentally different from Postgres or MySQL, and the same two or three concepts show up in every interview loop.
Think of this as a beginner-friendly Snowflake tutorial for data engineers — a first-principles walk through the Snowflake data warehouse from three-layer architecture to performance tuning. We start with "what is Snowflake database" in plain English, cover the killer "separation of compute and storage" idea, virtual warehouses, COPY INTO for loading, Time Travel and cloning for recovery / dev, micro-partitions and query pruning for performance, and how Snowflake compares to Redshift, BigQuery, Databricks, and Azure Synapse. Every section ships worked examples and a Snowflake interview questions-style problem with a full solution tail, in the same shape PipeCode practice problems use.
If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse ETL practice →, or open ETL System Design for Data Engineering Interviews → for a structured path.
On this page
- Why Snowflake matters
- The three-layer architecture
- Separation of compute and storage
- Loading and querying data
- Time Travel and zero-copy cloning
- Performance optimization
- Snowflake vs Redshift vs BigQuery
- Choosing Snowflake (checklist)
- Frequently asked questions
- Practice on PipeCode
1. Why Snowflake matters
What is Snowflake database — cloud-native data warehousing for analytics at scale
So, what is Snowflake database in one sentence? Snowflake is a cloud-based Snowflake data warehouse platform used to store, process, and analyze enormous amounts of data — orders of magnitude beyond what a single Postgres or MySQL instance can handle. Companies use it for data warehousing, analytics, BI dashboards, Snowflake data sharing, ETL/ELT pipelines, and ML feature storage, and it runs as a managed service on AWS, GCP, and Azure — the same Snowflake SQL surface, same UI, same features regardless of which cloud you pick.
Pro tip: When an interviewer asks "why Snowflake?", lead with the workload, not the brand. Snowflake exists because OLTP databases (Postgres, MySQL) become slow once a single table crosses a few hundred million rows under heavy analytical reads. Snowflake separates the analytical workload from the transactional one and scales each part independently.
Data warehouse vs OLTP database — different shapes for different jobs
The warehouse invariant: OLTP databases (Postgres, MySQL) are optimised for high-frequency single-row reads and writes; data warehouses (Snowflake, Redshift, BigQuery) are optimised for low-frequency, very wide scans over billions of rows; using one for the other workload produces a system that is slow on both axes. The line between them is mostly about row-store vs columnar storage and transactional vs analytical query patterns.
- OLTP — row-store: Postgres / MySQL store rows contiguously; reading one row of 30 columns is one disk seek.
- OLAP — columnar: Snowflake / Redshift / BigQuery store columns contiguously; reading one column of 100 M rows is one sequential scan.
- Transactions: OLTP holds row-level locks for ACID writes; warehouses commit in batches.
- Concurrency model: warehouses scale by spinning up parallel compute clusters; OLTP scales vertically.
Worked example. Same 100 M-row orders table, two workloads:
| query | Postgres (OLTP) | Snowflake (warehouse) |
|---|---|---|
INSERT INTO orders … VALUES (…) |
~1 ms | ~500 ms (batched) |
SELECT category, SUM(amount) FROM orders GROUP BY 1 |
60 s scan | 2 s columnar scan |
UPDATE orders SET status='shipped' WHERE order_id = 42 |
~1 ms | ~500 ms |
| 50 concurrent analyst dashboards | dies under load | each on its own warehouse |
Step-by-step.
- Postgres is great for the transactional inserts — single-row writes complete in milliseconds.
- Once an analyst runs
GROUP BY categoryover 100 M rows, Postgres scans every row and blocks the OLTP workload. - Snowflake stores
amountandcategoryas separate compressed column files; the sameGROUP BYreads ~5% of the bytes Postgres reads. - Snowflake also supports many parallel "virtual warehouses" so the 50 dashboards do not contend with the daily ETL.
- The right move is to keep transactional work in Postgres and ELT the data into Snowflake for analytics.
Worked-example solution. A typical split:
Application writes Daily ELT Analytics
───────────────────── ──────────────────── ─────────────────
Postgres source → Snowflake Snowflake
(orders, users, every hour or BI dashboards,
payments) every minute ML features,
ad-hoc SQL
Rule of thumb: if a query joins many tables, scans many rows, and runs on a schedule for humans to read, it belongs in a warehouse — not in your OLTP database.
Multi-cloud as a feature, not a buzzword
The multi-cloud invariant: Snowflake runs the same control plane and SQL surface on AWS, GCP, and Azure; an account is bound to one cloud and one region, but secure data sharing crosses cloud boundaries and replication is built in. You pick the cloud that matches the rest of your stack; you do not get locked into the warehouse vendor's preferred cloud.
-
Account region — one cloud (AWS / GCP / Azure) and one region (e.g.
us-east-1). - Cross-region replication — built-in; for HA and analytics close to consumers.
- Cross-cloud data sharing — shared databases work even when provider and consumer live on different clouds.
-
Same SQL surface —
CREATE WAREHOUSE,COPY INTO, Time Travel work identically across clouds.
Worked example. A SaaS company runs ingestion on GCP and BI on AWS:
| component | cloud | reason |
|---|---|---|
| product backend | GCP | existing team |
| ingestion → Snowflake | GCP-region Snowflake | same-cloud latency |
| BI dashboards | AWS-region Snowflake (read replica) | analyst tools on AWS |
Step-by-step.
- Ingestion writes raw events to a GCP Snowflake account; same-cloud egress is free / minimal.
- A Snowflake replication policy mirrors the curated schema to an AWS Snowflake account every 15 minutes.
- Analyst tools (Tableau, Looker, Mode) all live on AWS; queries hit the AWS account with low latency.
- Disaster recovery comes for free — either account can survive a single-cloud incident.
- The application teams pick whichever cloud suits them; the warehouse is a no-fight.
Worked-example solution. Cross-cloud replication setup (concept):
-- on the primary (GCP) account
CREATE REPLICATION GROUP analytics_repl
OBJECT_TYPES = (DATABASES)
ALLOWED_DATABASES = ('PROD_DW')
ALLOWED_ACCOUNTS = ('aws_account_locator');
-- on the secondary (AWS) account
CREATE DATABASE PROD_DW
AS REPLICA OF gcp_account_locator.PROD_DW;
ALTER DATABASE PROD_DW REFRESH;
Rule of thumb: let the team that writes the data pick the cloud and let everyone else attach via shared databases or replicated copies.
Real-world use cases — where Snowflake earns its keep
The use-case invariant: Snowflake is the right tool when the workload is analytical, the data volume is large, and the user count is concurrent; it is the wrong tool for low-latency single-row reads, sub-second OLTP transactions, or kilobyte-scale lookup tables. Recognising the workload is half the interview answer.
- BI dashboards — Looker, Tableau, Mode, Power BI all read from Snowflake natively.
- Customer analytics — clickstream, retention cohorts, funnel analysis.
- ML feature stores — typed, time-partitioned features served to training and online inference.
-
Financial reporting —
NUMERIC(38,6)precision, ACID transactions, audit history via Time Travel. - Secure data sharing — sell anonymised datasets to partners without ETL or file transfer.
Worked example. An e-commerce company's Snowflake schema:
| table | grain | source | consumer |
|---|---|---|---|
fact_orders |
one row per order line | Postgres CDC | BI, finance, ML |
fact_clicks |
one row per page view | Kafka → Kinesis | marketing |
dim_customer |
one row per customer (SCD2) | Postgres CDC | every fact |
dim_product |
one row per product | Postgres CDC | every fact |
Step-by-step.
- Orders, clicks, payments live transactionally in Postgres; events stream through Kafka.
- A CDC pipeline (Fivetran, Airbyte, custom Debezium) lands raw rows into Snowflake every few minutes.
- dbt models build star-schema fact / dimension tables from the raw layer.
- BI tools query the gold layer via
SELECT … FROM dim_customer JOIN fact_orders …. - Same
fact_orderstable feeds the daily revenue dashboard, the monthly investor report, and the ML feature pipeline — no copies, no drift.
Worked-example solution. A minimal fact_orders schema:
CREATE TABLE fact_orders (
order_id NUMBER(38,0) PRIMARY KEY,
customer_id NUMBER(38,0) NOT NULL,
product_id NUMBER(38,0) NOT NULL,
order_date DATE NOT NULL,
amount NUMBER(14,2) NOT NULL
)
CLUSTER BY (order_date);
Rule of thumb: "is this a dashboard, an ML feature, or a recurring report?" → Snowflake. "Is this a real-time write?" → Postgres / DynamoDB / Cassandra.
Common beginner mistakes
- Treating Snowflake as a faster Postgres — running single-row
INSERTs in a loop is slow because every commit is batched on object storage. - Picking Snowflake when the dataset fits on one machine — a daily 1 GB CSV does not need a cloud warehouse; SQLite or DuckDB are cheaper and faster.
- Forgetting to suspend warehouses — every minute a warehouse runs is billed; idle warehouses are real money.
- Storing OLTP-shaped row-by-row data — Snowflake compresses columns; wide schemas with few rows are an anti-pattern.
- Skipping the architecture layer in interviews — "Snowflake is fast" is not an answer; "it separates compute and storage" is.
Snowflake Interview Question on Picking a Warehouse vs Database
A team is debating whether to put a 100 M-row monthly aggregate report on top of their OLTP Postgres database or load it into Snowflake first. The Postgres database also serves the live shopping cart. Lay out the decision criteria and propose an architecture that keeps both the cart and the report performant.
Solution Using Postgres for OLTP + Snowflake for OLAP via Daily ELT
Code solution.
-- Postgres holds the transactional truth
CREATE TABLE postgres.public.orders (
order_id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
placed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
amount NUMERIC(14,2)
);
-- A daily ELT job lands the same rows in Snowflake
COPY INTO snowflake.dw.fact_orders
FROM @postgres_stage/orders/dt=2026-05-11/
FILE_FORMAT = (TYPE = PARQUET);
-- The monthly report runs entirely in Snowflake
SELECT DATE_TRUNC('month', placed_at) AS month,
SUM(amount) AS revenue
FROM snowflake.dw.fact_orders
GROUP BY 1
ORDER BY 1;
Step-by-step trace.
| step | actor | action | outcome |
|---|---|---|---|
| 1 | shopping cart | inserts new order into Postgres | row committed in ms |
| 2 | nightly ELT | exports yesterday's orders to Parquet on S3 | one staged file per day |
| 3 | nightly ELT | runs COPY INTO Snowflake |
rows added to fact_orders
|
| 4 | analyst | runs monthly report on Snowflake | 2 s columnar scan |
| 5 | Postgres | only sees OLTP load | cart stays fast |
Output: the cart latency stays under 100 ms because Postgres only runs OLTP work; the monthly report returns in seconds because Snowflake scans the partitioned, compressed columnar copy; the two systems never block each other.
Why this works — concept by concept:
- Postgres for OLTP — row-store + indexes + ACID transactions; perfect for single-order writes.
- Snowflake for OLAP — columnar + massively parallel; perfect for full-table aggregations.
- Daily ELT — moves the analytical workload to the analytical engine; freshness is "yesterday's data" which is fine for monthly reports.
-
COPY INTO— Snowflake's bulk loader; parallelises file ingestion across compute nodes. - One source of truth — Postgres remains the system of record; Snowflake is a derived copy that can be rebuilt at any time.
-
Cost— Postgres reads stayO(1)per cart op; Snowflake aggregation isO(N)on the columnar copy but runs in parallel and never touches Postgres.
Inline CTA: drill the ETL practice page for ingestion patterns and the SQL practice page for analytical SQL fluency.
ETL
Topic — ETL
ETL practice problems
SQL
Topic — aggregations
SQL aggregation problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
2. The three-layer architecture
Storage, compute (virtual warehouses), and cloud services — decoupled by design
Snowflake's killer architectural decision is three independent layers that scale and pay for themselves separately: a storage layer that holds your data on cloud object storage forever, a compute layer of "virtual warehouses" that run queries in isolated clusters, and a cloud services layer that handles authentication, optimisation, metadata, and security. Legacy MPP warehouses (and older Redshift) couple compute and storage into a single cluster; Snowflake's split is what makes everything else (Time Travel, cloning, multi-tenant compute) possible.
Pro tip: When an interviewer asks "explain Snowflake's architecture," name the three layers in the order storage → compute → cloud services and immediately add "and the key idea is that compute and storage scale independently." That single sentence covers 70% of the architecture answer; the rest is detail.
Database storage layer — compressed columnar files on object storage
The storage invariant: Snowflake stores every table as a set of compressed, columnar **micro-partitions (50–500 MB each) on cloud object storage (S3 / GCS / ADLS); the database engine manages compression, encryption, metadata, and file organisation automatically — you write SQL, Snowflake handles the rest**. There is no VACUUM, no manual partitioning, no index maintenance.
- Micro-partitions — 50–500 MB compressed columnar files; automatically sized.
- Columnar format — every column stored separately; analytical scans read only needed columns.
- Automatic compression — Snowflake picks the codec per column based on data distribution.
- Immutable files — updates write new files; old files retained for Time Travel.
- Per-column statistics — min/max/distinct count per micro-partition; powers query pruning.
Worked example. A 100 M-row orders table laid out internally:
| component | what Snowflake stores |
|---|---|
orders.order_id |
~5,000 micro-partitions, sorted by order_id, RLE-compressed |
orders.amount |
same partitions, ZSTD-compressed |
orders.placed_at |
same partitions, dictionary-encoded dates |
| metadata | per-partition min/max/distinct for every column |
Step-by-step.
- You write
INSERT INTO orders SELECT … FROM staging; Snowflake doesn't write rows — it writes columnar files. - Each batch produces a handful of new micro-partitions (typical batch ≈ 16 MB compressed).
- The cloud-services layer records per-column min/max in metadata for every new partition.
- A later
WHERE placed_at = '2026-05-10'can skip ~99% of partitions using the date min/max — that's query pruning. - The original files are never modified; an
UPDATEwrites new partitions and marks the old ones as expired (visible via Time Travel for the retention window).
Worked-example solution. A typical Snowflake CREATE TABLE with Snowflake data types (NUMBER(p,s), TIMESTAMP_TZ) and a CLUSTER BY for predictable partition layout:
CREATE TABLE fact_orders (
order_id NUMBER(38,0),
customer_id NUMBER(38,0),
placed_at TIMESTAMP_TZ,
amount NUMBER(14,2)
)
CLUSTER BY (placed_at);
-- micro-partitions now naturally co-locate by date,
-- making date-range queries skip more partitions
Rule of thumb: you do not manage storage; you do not run VACUUM. If a query is slow on a clustered table, the answer is usually change the cluster key, not "rewrite the table."
Compute layer — virtual warehouses run the queries
The compute invariant: a virtual warehouse is a named, sized, isolated MPP compute cluster that runs your SQL; warehouses can be created, resumed, suspended, and resized independently; many warehouses can read the same storage simultaneously without contention. The pricing model is straightforward — you pay per credit per second of warehouse uptime; suspending a warehouse stops the meter.
-
Warehouse sizes —
X-SMALL(1 node),SMALL(2),MEDIUM(4), … up to6X-LARGE(512). - Multi-cluster warehouses — auto-scale parallel clusters when concurrency grows.
- Auto-suspend / auto-resume — pause after N minutes idle; wake on demand.
- Per-team isolation — ETL on warehouse A, analysts on warehouse B; one cannot slow the other.
- Billing — per-second after a 60-second minimum.
Worked example. A team-isolated warehouse design:
| warehouse | size | who uses | typical workload |
|---|---|---|---|
WH_ETL |
MEDIUM | nightly pipeline | one heavy MERGE per night |
WH_BI |
SMALL | dashboard tools | hundreds of small concurrent queries |
WH_ANALYSTS |
LARGE | ad-hoc SQL | occasional 10 B-row scans |
WH_ML |
XLARGE | feature pipeline | scheduled hourly batches |
Step-by-step.
- Each team's queries route to their own warehouse — a misbehaving analyst query cannot block the BI dashboard.
- The ETL warehouse runs for ~45 min/night, then auto-suspends; you pay only for that window.
- The BI warehouse stays warm during business hours with multi-cluster auto-scaling so 200 concurrent dashboards never queue.
- The analyst warehouse spins up only when someone runs a big ad-hoc query.
- All four warehouses read and write the same underlying tables — there is one source of truth.
Worked-example solution. Create and size a warehouse:
CREATE WAREHOUSE WH_BI
WITH WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60 -- pause after 60s idle
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4
SCALING_POLICY = 'STANDARD';
Rule of thumb: every team gets their own warehouse named after the team. Cost attribution and noise isolation come together that way.
Cloud services layer — the brain that ties it all together
The services invariant: the cloud services layer handles authentication, query optimisation, metadata, transaction management, security, and access control; it is shared across all warehouses; you never interact with it directly but it powers every "Snowflake feels magic" experience. It is also what makes zero-copy cloning, secure data sharing, and Time Travel cheap.
- Authentication & RBAC — users, roles, grants; role-based access at every level.
- Query optimiser — column statistics + cost model produce the execution plan.
- Metadata store — micro-partition stats, transaction log, history.
- Result cache — recent query results returned without warehouse compute.
- Background services — re-clustering, materialised view maintenance.
Worked example. A query lifecycle:
| step | layer | action |
|---|---|---|
| 1 | cloud services | authenticate user, check role grants |
| 2 | cloud services | parse SQL, compile execution plan, fetch metadata |
| 3 | cloud services | check result cache — if hit, return immediately (no compute) |
| 4 | virtual warehouse | nodes fetch needed micro-partitions from storage |
| 5 | virtual warehouse | run scan / filter / aggregate in parallel |
| 6 | cloud services | gather results, cache, return to client |
Step-by-step.
- The client submits SQL with a session token; cloud services verifies the token and resolves grants.
- The optimiser uses table metadata (partition stats, clustering, statistics) to pick the cheapest plan.
- Result cache — if the same SQL on the same data was answered in the last 24 h, the result is returned instantly with no warehouse usage.
- On a miss, the active warehouse spins up nodes that fetch the needed columns from object storage.
- Compute aggregates and returns; the result is cached and the warehouse goes back to idle.
Worked-example solution. Result-cache demo:
ALTER SESSION SET USE_CACHED_RESULT = TRUE; -- default
SELECT COUNT(*) FROM fact_orders; -- first run: 4 s warehouse compute
SELECT COUNT(*) FROM fact_orders; -- second run: 60 ms cache hit
Rule of thumb: if a repeated query suddenly takes seconds again, suspect that someone modified the underlying table — cache invalidates on any change.
Common beginner mistakes
- Confusing virtual warehouses with databases — a warehouse is compute, a database is storage; both are needed.
- Sizing the warehouse for the peak instead of the average — bigger warehouses cost linearly more; right-size and use multi-cluster scaling.
- Leaving warehouses without auto-suspend — every idle minute is a real charge.
- Putting every team's queries on one warehouse — one slow query starves everyone.
- Forgetting the result cache exists — re-running benchmarks without flushing the cache reports unrealistic numbers.
Snowflake Interview Question on Designing a Multi-Team Warehouse Strategy
A 50-person data team complains that "Snowflake is slow at 9 AM." Everyone shares one XLARGE warehouse: ETL, analysts, BI, ML. Propose a multi-warehouse design that fixes the 9 AM contention without paying more in total credits.
Solution Using Per-Team Warehouses with Auto-Suspend and Multi-Cluster Scaling
Code solution.
-- ETL: heavy, scheduled, short bursts
CREATE WAREHOUSE WH_ETL
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- BI: many small queries, business-hours concurrency
CREATE WAREHOUSE WH_BI
WITH WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4; -- multi-cluster for concurrency
-- Analysts: occasional big queries
CREATE WAREHOUSE WH_ANALYSTS
WITH WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60;
-- ML: scheduled feature jobs
CREATE WAREHOUSE WH_ML
WITH WAREHOUSE_SIZE = 'XLARGE'
AUTO_SUSPEND = 60;
Step-by-step trace.
| step | observation |
|---|---|
| 1 | original shared XLARGE |
| 2 | split into 4 warehouses, each right-sized |
| 3 |
AUTO_SUSPEND = 60 everywhere |
| 4 |
WH_BI adds multi-cluster scaling |
| 5 | total credits/day drops |
| 6 | nobody waits behind another team's query |
Output: the 9 AM dashboard lag disappears (BI auto-scales horizontally), the nightly ETL stops fighting the analyst's ad-hoc queries, and the total credit bill drops because nothing is "always on" anymore.
Why this works — concept by concept:
- Per-team warehouses — each team's queries route to their own compute; nobody else can starve them.
- Right-sized warehouses — BI needs concurrency (multi-cluster small); analysts need vertical power (large); they are not the same shape.
-
AUTO_SUSPEND = 60— the silver bullet of Snowflake cost — warehouses billing stops 60s after the last query. -
Multi-cluster scaling on
WH_BI— additional clusters spin up when queue depth grows, then drop when it falls; no human tuning. - Same storage, isolated compute — all four warehouses read identical tables; one source of truth.
-
Cost— moves billing from "one big always-on warehouse" to "many right-sized warehouses billed only while running"; typical savings 30–60%.
Inline CTA: see ETL System Design for Data Engineering Interviews for end-to-end warehouse-shaping playbooks.
ETL
Topic — ETL
ETL practice problems
SQL
Language — SQL
All SQL practice problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
3. Separation of compute and storage
Independent scaling of warehouses and data — the single most important Snowflake idea
In a legacy warehouse (Teradata, classic Redshift), compute and storage are bolted together — you buy a "cluster" with both, and if you need more of either you have to buy both. Snowflake's defining decision is that compute (virtual warehouses) and storage (cloud object storage) scale independently: spin up a XLARGE warehouse for a one-hour backfill, then drop back to SMALL; add a petabyte of data without touching compute; never pay for capacity you are not using right now.
Pro tip: This is the single most-asked Snowflake interview question, in some form, across data-engineering loops. Memorise the one-line answer: "Compute lives in virtual warehouses that I size and suspend independently; storage lives once on object storage and every warehouse reads the same files."
How the scaling actually works
The scaling invariant: adding a node to a warehouse, resizing a warehouse, or creating a new warehouse never moves data; the new compute simply fetches the same micro-partitions from object storage. The implication is huge — you can resize compute in seconds (no rebalancing) and the data layer never blocks an operational change.
-
Resize —
ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE='LARGE'— takes seconds; no data motion. - New warehouse — creates a new cluster pointing at the same storage; you can have N warehouses on the same data.
-
Suspend —
ALTER WAREHOUSE WH_ETL SUSPEND— stops the compute meter; data remains on object storage. -
Resume —
ALTER WAREHOUSE WH_ETL RESUME— spins compute back up in seconds. - Storage grows independently — adding 1 PB does not change warehouse sizing.
Worked example. A one-hour backfill at the end of the quarter:
| time | warehouse size | what's running |
|---|---|---|
| 00:00–08:00 | MEDIUM |
normal nightly ETL |
| 08:00–09:00 | resized to XLARGE (8× faster) |
one-time quarterly backfill |
| 09:00 onwards | back to MEDIUM
|
resume normal work |
Step-by-step.
- The team has a 2 B-row backfill that would take 8 hours on the regular
MEDIUMwarehouse. - At 08:00 the operator runs
ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE='XLARGE'. - The warehouse scales from 4 to 64 nodes in seconds — no data moves.
- The backfill completes in ~1 hour because compute is 8× larger.
- At 09:00 the operator runs
SET WAREHOUSE_SIZE='MEDIUM'and the credit bill goes back to the steady-state rate.
Worked-example solution. Temporary upsize for a backfill:
ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE = 'XLARGE';
-- run the backfill
INSERT INTO fact_orders_history
SELECT * FROM raw.orders WHERE order_date < '2026-01-01';
-- back to steady state
ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE = 'MEDIUM';
Rule of thumb: resize for the slow query, then resize back. Snowflake makes this a 5-minute operation, not a migration.
Why this breaks legacy assumptions
The legacy-comparison invariant: Teradata, classic Redshift, and on-prem MPP warehouses all couple compute and storage; adding storage means adding compute, and resizing compute requires a data rebalance. Snowflake's separation removes both constraints — the cost model and operational model are fundamentally different.
- Coupled (legacy) — pay for over-provisioned compute year-round to handle peak storage.
- Coupled — resize = rebalance = downtime.
- Decoupled (Snowflake) — pay for compute by the second, only when running.
- Decoupled — resize = seconds; new warehouse = seconds; no data motion.
Worked example. Same workload, two architectures:
| dimension | legacy MPP (Teradata / old Redshift) | Snowflake |
|---|---|---|
| add 1 TB of data | requires bigger cluster | no compute change needed |
| resize compute | hours of rebalance | seconds |
| dev/test workload | needs own cluster | new warehouse on same data |
| paying for peak | always | only during the peak |
Step-by-step.
- A legacy 8-node Teradata cluster sized for the year-end peak runs at 20% utilisation the other 51 weeks.
- The same workload on Snowflake uses a SMALL warehouse 51 weeks of the year, scales to XLARGE for one week, then back.
- Storage costs are roughly the same (both are object-store class).
- Compute costs drop ~70% because you only pay for the XLARGE during the week it is needed.
- New environments (dev, staging) are free — they are just new warehouse names pointing at the same storage (or a clone).
Worked-example solution. Dev environment using a zero-copy clone:
CREATE DATABASE PROD_DW_DEV CLONE PROD_DW;
-- new warehouse for the dev team
CREATE WAREHOUSE WH_DEV WITH WAREHOUSE_SIZE = 'XSMALL';
USE DATABASE PROD_DW_DEV;
USE WAREHOUSE WH_DEV;
Rule of thumb: if a Snowflake setup ever "feels like" a legacy MPP cluster — always-on, hard to resize, single-tenant — it is being run wrong.
Cost implications and credit economics
The credit invariant: storage cost is roughly constant (compressed columnar on cloud storage); compute cost is variable and dominated by how long warehouses run; suspending warehouses and right-sizing them is the single highest-leverage cost lever Snowflake gives you. The default account settings are not always cost-optimal; tuning them matters.
- Credits — Snowflake's compute currency; price varies by region and edition.
- Warehouse credit-per-hour — XS=1, S=2, M=4, L=8, XL=16 (linear with size).
-
AUTO_SUSPEND— default may be 10 min; set to 60 s for spiky workloads. - Storage — flat $/TB/month; minor compared to compute for most workloads.
- Result cache — free; queries served from cache don't burn credits.
Worked example. Monthly bill comparison:
| design | warehouse | hours/month running | credits | cost |
|---|---|---|---|---|
| naive — always-on XL | XLARGE | 730 | 11,680 | $35,040 |
| auto-suspended XL | XLARGE | 80 | 1,280 | $3,840 |
| right-sized + auto-suspend | MEDIUM most, XL spike | 60 | 320 | $960 |
Step-by-step.
- Always-on XL: 730 hours × 16 credits/hr × $3/credit = $35 k/month.
- Same XL but with
AUTO_SUSPEND = 60s: only runs when queries are active; ~80 hours/month → $3,840. - Right-sized — MEDIUM for the steady state, XL only for the weekly backfill: ~$960.
- The data is identical; only the compute schedule changes.
- Result cache further reduces this for repeated queries.
Worked-example solution. Cost-aware warehouse config:
CREATE WAREHOUSE WH_BI
WITH WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;
Rule of thumb: the first rule of Snowflake cost is suspend warehouses; the second rule is right-size warehouses; everything else is rounding error.
Common beginner mistakes
- Resizing a warehouse and waiting for "data to move" — it does not; the resize is metadata-only.
- Running
XLARGEalways-on for occasional queries — pay for anXSMALL23 hours a day and anXLARGEfor the one hour it is needed. - Treating the result cache as a free pass for "fast" queries that are actually expensive on a cold cache.
- Ignoring
AUTO_SUSPEND— the default of 10 minutes is wasteful for low-frequency workloads. - Building a single shared warehouse for everyone — undoes the entire isolation benefit.
Snowflake Interview Question on Cost-Optimising a $50k Monthly Bill
The CFO points at a $50k/month Snowflake bill. Your single XLARGE warehouse is AUTO_SUSPEND = NULL (never suspends). Average usage is 4 hours/day across two distinct workloads (BI in business hours, ETL at night). Cut the bill by at least 60% without losing performance.
Solution Using Workload Isolation + Auto-Suspend + Right-Sizing
Code solution.
-- Stop the always-on XL
ALTER WAREHOUSE WH_OLD SUSPEND;
DROP WAREHOUSE WH_OLD;
-- BI: business-hours, many small queries
CREATE WAREHOUSE WH_BI
WITH WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 3;
-- ETL: nightly, single big batch
CREATE WAREHOUSE WH_ETL
WITH WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60;
Step-by-step trace.
| step | observation | monthly cost |
|---|---|---|
| 1 | original XLARGE always-on |
730 h × 16 credits × $3 ≈ $35k/month per warehouse |
| 2 | usage audit: 4 BI hours/day + 1 ETL hour/night | most of the bill was idle time |
| 3 | split into BI SMALL + ETL LARGE | both auto-suspend |
| 4 | BI: 4 h/day × 30 = 120 h × 2 credits = 240 credits | ~$720 |
| 5 | ETL: 1 h/night × 30 = 30 h × 8 credits = 240 credits | ~$720 |
| 6 | new total | $1,440 ≈ 97% reduction |
Output: monthly Snowflake spend drops from $50k to roughly $1.5k. BI users still see sub-second dashboards (multi-cluster scaling absorbs the morning spike). ETL still completes in its nightly window (LARGE is fast enough). Nothing breaks.
Why this works — concept by concept:
- Workload isolation — BI and ETL have different concurrency profiles; one warehouse cannot serve both well.
-
AUTO_SUSPEND = 60— the warehouse meter stops 60s after the last query; idle time is no longer paid for. - Right-sizing — BI gets SMALL with multi-cluster (concurrency); ETL gets LARGE (throughput). No need for XL.
- Same storage — no data motion; both warehouses read the same tables.
-
Visible per-warehouse cost — separate warehouses surface per-team spend in
WAREHOUSE_METERING_HISTORY. -
Cost— credit consumption proportional to active query time, not wall-clock time.
Inline CTA: drill the ETL practice page for warehouse-sizing scenarios.
ETL
Topic — ETL
ETL practice problems
SQL
Topic — aggregations
SQL aggregation problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
4. Loading and querying data
Stages, Snowflake COPY INTO, file formats, and the Snowflake SQL surface
The Snowflake COPY INTO command is the primary bulk-load mechanism — it reads files from a stage (an internal or external file location) and inserts them into a table in parallel. The file format is declared explicitly (CSV, JSON, Parquet, Avro, ORC). Once data is in, you query it with standard Snowflake SQL — SELECT / JOIN / GROUP BY look identical to the dialect you already know.
Pro tip: Interviewers love the
COPY INTOquestion because it has clear right answers — file format, error handling, parallelism, and idempotency are all observable design choices. Practise saying "I stage the files, declare the format, and run COPY INTO withON_ERROR = SKIP_FILE_AND_CONTINUEand a load history check" in one sentence.
Stages — external and internal file locations
The stage invariant: a stage is a named file location Snowflake knows how to read from; **internal stages live inside Snowflake (managed for you); external stages point at S3 / GCS / ADLS buckets you manage; both behave identically for COPY INTO**. Stages are also reusable — one stage definition can be reused by many COPY INTO statements.
-
Internal stage —
@~/path(user),@%TABLE(table),@stage_name(named). -
External stage — points at
s3://bucket/path/,gs://bucket/path/,azure://…. - Storage integration — security object that grants Snowflake permission to read the bucket.
-
Listing —
LIST @my_stage;shows files visible in the stage.
Worked example. Define an external S3 stage:
| object | purpose |
|---|---|
STORAGE INTEGRATION |
IAM trust between Snowflake and AWS |
FILE FORMAT |
declares CSV / JSON / Parquet rules |
EXTERNAL STAGE |
named location pointing at the bucket |
COPY INTO |
the load command that uses the stage + format |
Step-by-step.
- Create a
STORAGE INTEGRATIONin Snowflake; this generates an IAM trust policy you paste into AWS. - Create a
FILE FORMATdescribing the data —TYPE = PARQUETis the simplest; CSV needs more options. - Create an
EXTERNAL STAGEthat combines the integration and the bucket path. - List files in the stage to confirm permissions:
LIST @prod_s3_stage. - Run
COPY INTOagainst the stage; Snowflake fetches files in parallel across warehouse nodes.
Worked-example solution. End-to-end stage setup:
CREATE STORAGE INTEGRATION s3_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeReadRole'
STORAGE_ALLOWED_LOCATIONS = ('s3://my-bucket/snowflake/');
CREATE FILE FORMAT ff_parquet
TYPE = PARQUET;
CREATE STAGE prod_s3_stage
STORAGE_INTEGRATION = s3_int
URL = 's3://my-bucket/snowflake/'
FILE_FORMAT = ff_parquet;
Rule of thumb: one storage integration per AWS account, one stage per logical bucket path, one file format per file shape — keeps grants and schemas tidy.
COPY INTO — the bulk loader
The COPY invariant: COPY INTO table FROM @stage parallelises file ingestion across all nodes of the active warehouse; errors are handled by ON_ERROR policy; COPY INTO is idempotent on a per-file basis — re-running the command will skip files already loaded (tracked in LOAD_HISTORY). The same command works for any file format the stage's file format declared.
- Parallelism — file count × warehouse nodes; more files + bigger warehouse = faster load.
-
ON_ERROR—CONTINUE(skip rows),SKIP_FILE_AND_CONTINUE,ABORT_STATEMENT(fail fast). -
PATTERN— regex filter on filenames; load only.parquetetc. -
Load history —
LOAD_HISTORYview records every file's commit; re-running skips loaded files. -
PURGE = TRUE— delete source files after successful load.
Worked example. Daily Parquet drop into fact_orders:
| stage prefix | file | loaded? |
|---|---|---|
s3://…/orders/dt=2026-05-10/ |
part-0000.parquet |
✓ (yesterday's run) |
s3://…/orders/dt=2026-05-11/ |
part-0000.parquet |
new |
s3://…/orders/dt=2026-05-11/ |
part-0001.parquet |
new |
Step-by-step.
- The previous day's
COPY INTOloadeddt=2026-05-10/part-0000.parquet; it appears inLOAD_HISTORY. - Today's
COPY INTOruns against the same stage path. - Snowflake consults
LOAD_HISTORY, seesdt=2026-05-10/part-0000.parquetwas already loaded, and skips it. - The two new files for
dt=2026-05-11/are loaded in parallel. - Re-running tonight's command would skip every file because all three are now in
LOAD_HISTORY— idempotency.
Worked-example solution. Daily idempotent load:
COPY INTO fact_orders
FROM @prod_s3_stage/orders/
FILE_FORMAT = (FORMAT_NAME = ff_parquet)
PATTERN = '.*[.]parquet'
ON_ERROR = 'SKIP_FILE_AND_CONTINUE';
-- inspect what loaded
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
TABLE_NAME => 'FACT_ORDERS',
START_TIME => DATEADD(hours, -1, CURRENT_TIMESTAMP)
));
Rule of thumb: always wrap COPY INTO with a post-load row-count assertion and an alert on LOAD_HISTORY errors — silent skips are how data drifts.
File formats — CSV vs JSON vs Parquet
The format invariant: Parquet and other columnar formats (ORC) load faster, compress better, and preserve types; CSV is the lowest-common-denominator and pays a real cost in load time and schema fidelity; JSON works with VARIANT columns and is fine for semi-structured payloads. Default to Parquet for anything you control.
- Parquet / ORC — columnar; preserves types; smallest on-disk size; fastest load.
- CSV — text; needs explicit schema + quote / escape rules; slowest.
-
JSON — semi-structured; loads into a
VARIANTcolumn; query with:keydot notation. - Avro — common in streaming; binary; schema embedded.
Worked example. Loading the same 1 M-row dataset:
| format | file size | COPY time on MEDIUM | notes |
|---|---|---|---|
| CSV (gzipped) | 220 MB | 90 s | requires explicit format definition |
| JSON (gzipped) | 280 MB | 70 s | landing into VARIANT column |
| Parquet (snappy) | 60 MB | 12 s | columnar; types preserved |
Step-by-step.
- CSV file is largest and slowest because every row is reparsed as text, type-coerced, and validated.
- JSON is similar but lands into a single
VARIANTcolumn — fast for sparse / nested data, awkward for analytics SQL. - Parquet is columnar and binary; Snowflake reads only the columns it needs; load time drops 7×.
- Storage costs follow the same ratio — Parquet files compress better.
- If your source format is a choice (ETL between systems you control), pick Parquet.
Worked-example solution. Three file-format definitions:
CREATE OR REPLACE FILE FORMAT ff_csv
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
NULL_IF = ('','NULL','null')
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
CREATE OR REPLACE FILE FORMAT ff_json
TYPE = JSON
STRIP_OUTER_ARRAY = TRUE;
CREATE OR REPLACE FILE FORMAT ff_parquet
TYPE = PARQUET;
Rule of thumb: between systems you control, Parquet. Across vendor boundaries you cannot change, CSV. For event streams, JSON or Avro.
Common beginner mistakes
- Loading raw CSVs without explicit
FILE FORMAT— fields with embedded commas break silently. - Running
COPY INTOwithoutON_ERROR— one bad row aborts the entire load. - Forgetting
LOAD_HISTORYand reloading files twice — your fact table doubles silently. - Picking JSON for tabular data — wastes Snowflake's columnar strengths.
- Storing the access keys in code instead of a
STORAGE INTEGRATION— leaks the secret.
Snowflake Interview Question on a Daily S3 Drop That Sometimes Has Bad Rows
The team gets a daily 5 GB CSV drop from a partner into S3. About 0.1% of rows have malformed amounts. The current COPY INTO errors and the daily load fails. Design an ingestion pipeline that loads the good rows, captures the bad ones for review, and is idempotent on rerun.
Solution Using ON_ERROR = CONTINUE + a Rejected-Rows Table + LOAD_HISTORY Check
Code solution.
CREATE TABLE raw_orders (
order_id NUMBER,
customer_id NUMBER,
amount NUMBER(14,2),
placed_at TIMESTAMP_NTZ
);
CREATE TABLE rejected_orders LIKE raw_orders;
-- main load: skip bad rows but keep going
COPY INTO raw_orders
FROM @prod_s3_stage/orders/
FILE_FORMAT = (FORMAT_NAME = ff_csv)
PATTERN = '.*[.]csv'
ON_ERROR = 'CONTINUE'
RETURN_FAILED_ONLY = FALSE;
-- capture the rejected rows for review
INSERT INTO rejected_orders
SELECT *
FROM TABLE(VALIDATE(raw_orders, JOB_ID => '_LAST'));
Step-by-step trace.
| step | action | result |
|---|---|---|
| 1 | partner drops orders_2026-05-11.csv
|
5 M rows; ~5 k malformed |
| 2 |
COPY INTO with ON_ERROR = CONTINUE
|
loads 4,995,000 good rows |
| 3 |
LOAD_HISTORY shows file committed |
won't reload on rerun |
| 4 |
VALIDATE(... JOB_ID => '_LAST') returns 5 k bad rows |
inserted into rejected_orders
|
| 5 | partner sees rejection report, fixes upstream | next day cleaner |
Output: the daily load completes; good rows land in raw_orders; bad rows are captured in rejected_orders with their reasons for inspection; LOAD_HISTORY ensures the same file is never loaded twice on rerun.
Why this works — concept by concept:
-
ON_ERROR = CONTINUE— partial loads succeed; one bad row doesn't kill the daily pipeline. -
VALIDATE(... JOB_ID => '_LAST')— captures rejected rows from the most recent load for forensic review. -
LOAD_HISTORYidempotency — reruns skip files already committed; safe to retry. -
Separate
rejected_orderstable — keeps the failure rate visible and reviewable, not silently lost. -
Pattern-based file selection —
.*[.]csvensures only intended files load. -
Cost— load timeO(rows / warehouse size); the validate call is metadata-only.
Inline CTA: the canonical ingestion-design syllabus is in ETL System Design for Data Engineering Interviews.
ETL
Topic — ETL
ETL practice problems
SQL
Topic — filtering
SQL filtering problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
5. Time Travel and zero-copy cloning
Recovery, audit, and instant dev environments
Snowflake's Time Travel lets you query a table as it existed at any point in the recent past (1 day by default, up to 90 days on Enterprise). Zero-copy cloning lets you create a new database, schema, or table that shares the same underlying micro-partitions as the source — no data is copied, the clone is free, and edits diverge from that moment onward. Snowflake Dynamic Tables layer on top of these primitives to give you declarative, automatically-refreshed materialised views for the ELT layer. Together, these features turn data recovery, dev-environment provisioning, and forensic debugging from multi-hour ordeals into single SQL statements.
Pro tip: When asked "what makes Snowflake different operationally?", the two-word answer is "Time Travel and cloning." Both are direct consequences of the immutable-micro-partition storage layer; legacy warehouses cannot offer them because their storage isn't shaped this way.
Time Travel — querying historical state
The Time-Travel invariant: for every table, Snowflake retains the micro-partitions that made up its state for a retention period (default 1 day on Standard, configurable up to 90 days on Enterprise); within that window, AT (TIMESTAMP => …) or BEFORE (STATEMENT => …) clauses return the table's historical state. The feature is the cheapest "we accidentally dropped a table" recovery on the market.
-
AT (OFFSET => -3600)— table state 1 hour ago. -
AT (TIMESTAMP => '2026-05-11 14:00:00')— table state at that exact moment. -
BEFORE (STATEMENT => '<query_id>')— table state just before a specific query ran. -
DATA_RETENTION_TIME_IN_DAYS— column-level setting; default 1, max 90 (Enterprise). -
UNDROP TABLE / DATABASE— short-cut to restore a dropped object within retention.
Worked example. A junior accidentally truncates dim_customer:
| time | event | what Time Travel can do |
|---|---|---|
| 14:00:00 |
dim_customer healthy |
(normal) |
| 14:05:32 | TRUNCATE TABLE dim_customer |
rows gone |
| 14:07:10 | data team notices | panic |
| 14:08:00 | run INSERT INTO dim_customer SELECT * FROM dim_customer AT (TIMESTAMP => '2026-05-11 14:05:00')
|
rows restored |
Step-by-step.
- The truncate runs; Snowflake marks the micro-partitions as expired but retains them for the retention window.
- The team finds the query in
QUERY_HISTORYand notes itsquery_id. -
SELECT * FROM dim_customer BEFORE (STATEMENT => 'abc-123-def')returns the table as it existed just before the truncate. - Wrapping the same query in an
INSERT INTO dim_customerrestores the data in seconds — no backup tape, no S3 restore. - Future incidents within the retention window are recoverable the same way.
Worked-example solution. Full recovery script:
-- find the offending query
SELECT query_id, query_text, start_time
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%TRUNCATE%dim_customer%'
ORDER BY start_time DESC
LIMIT 1;
-- restore using BEFORE
INSERT INTO dim_customer
SELECT * FROM dim_customer BEFORE (STATEMENT => 'abc-123-def-456');
Rule of thumb: Time Travel saves your weekend the first time someone runs a destructive query in prod. Configure retention to match your "how long until someone notices" SLA.
Zero-copy cloning — instant dev / test environments
The cloning invariant: CREATE … CLONE produces a new object (database, schema, or table) that shares the source's underlying micro-partitions; until either side writes, the clone is free in storage; once a clone writes, only the diverged partitions cost extra. Cloning a 100 TB database for a feature branch takes seconds and costs near-zero.
-
CREATE TABLE x_clone CLONE x— clone one table. -
CREATE SCHEMA s_dev CLONE s_prod— clone all tables in a schema. -
CREATE DATABASE db_dev CLONE db_prod— clone an entire database. - Copy-on-write — clones diverge only on the rows that actually change.
-
Clone-at-point-in-time —
CLONE … AT (TIMESTAMP => …)— combines cloning and Time Travel.
Worked example. Stand up a dev environment in 30 seconds:
| step | action | storage cost |
|---|---|---|
| 1 | prod DB_PROD has 100 TB of orders |
$$$ |
| 2 | CREATE DATABASE DB_DEV CLONE DB_PROD |
0 (shared micro-partitions) |
| 3 | dev team runs INSERT and UPDATE over a few thousand rows |
+ a few MB diverged |
| 4 | prod queries still see prod state | isolated |
| 5 | dev queries see clone + diverged state | isolated |
Step-by-step.
-
CREATE DATABASE DB_DEV CLONE DB_PRODreturns in seconds. - Snowflake records that every table in
DB_DEVpoints at the same micro-partitions asDB_PROD. - Dev team can run any DDL or DML against
DB_DEVwithout touching prod data. - Each write to
DB_DEVwrites new partitions; the unchanged partitions remain shared. - When dev is done,
DROP DATABASE DB_DEVremoves only the diverged partitions; the shared ones stay with prod.
Worked-example solution. Feature-branch dev environment:
-- Snapshot prod at a clean moment
CREATE DATABASE DB_DEV_FEATURE_X
CLONE DB_PROD
AT (TIMESTAMP => CURRENT_TIMESTAMP());
GRANT USAGE ON DATABASE DB_DEV_FEATURE_X TO ROLE dev_role;
GRANT USAGE ON SCHEMA DB_DEV_FEATURE_X.public TO ROLE dev_role;
GRANT SELECT ON ALL TABLES IN SCHEMA DB_DEV_FEATURE_X.public TO ROLE dev_role;
Rule of thumb: every dev team gets their own clone. The cost of cloning is so low that "no shared dev environment" should be your default policy.
Retention windows and the cost of long Time Travel
The retention-cost invariant: retaining expired micro-partitions for the Time-Travel window costs storage, not compute; longer retention = more storage; the math is small for tables that change slowly, larger for high-churn tables. Pick the retention window per table based on (a) how long it takes to notice mistakes and (b) how much the table churns.
- Standard edition — max retention 1 day.
- Enterprise edition — max 90 days; default still 1.
-
ALTER TABLE … SET DATA_RETENTION_TIME_IN_DAYS = N— per-table override. - Fail-safe — additional 7-day retention beyond Time Travel; Snowflake-managed, not user-queryable.
-
Cost driver — high-churn tables (frequent
UPDATE/DELETE) accumulate many historical partitions.
Worked example. Retention cost per table:
| table | churn (rows/day) | retention (days) | extra storage |
|---|---|---|---|
dim_customer |
low (1 k changes) | 30 | tiny |
fact_clicks (insert-only) |
high (50 M rows/day) | 7 | ~350 M rows × 7 |
dim_product (rarely changes) |
almost zero | 90 | tiny |
staging_* (volatile) |
high | 1 | minimal |
Step-by-step.
- For low-churn dimensions, longer retention costs almost nothing — those tables rarely write new partitions.
- For high-churn facts, retention multiplies the storage cost proportionally.
- Staging tables don't need 7+ days — they're rebuilt daily; set retention to 1.
- Production facts that you'd want to recover from a logic bug deserve 7-14 days.
- Per-table tuning is more cost-effective than a single account-wide retention.
Worked-example solution. Right-sized retention:
ALTER TABLE dim_customer SET DATA_RETENTION_TIME_IN_DAYS = 30;
ALTER TABLE fact_clicks SET DATA_RETENTION_TIME_IN_DAYS = 7;
ALTER TABLE staging_orders SET DATA_RETENTION_TIME_IN_DAYS = 1;
Rule of thumb: the longer retention, the longer your safety net; the longer retention on high-churn tables, the higher the storage bill. Tune per table.
Common beginner mistakes
- Assuming Time Travel works forever — the default is 1 day; past that, you need Enterprise + per-table retention.
- Treating fail-safe as user-queryable — it is not; only Snowflake support can recover from fail-safe.
- Cloning to "back up" — clones share storage; if you
DROPthe source, the clone is unaffected, but they aren't a true off-cluster backup. - Forgetting that updates erode retention — a heavy
UPDATEon a fact table can blow up storage costs if retention is long. - Querying historical data with
AT (OFFSET => -86400)when the table's retention is 0 — silently errors.
Snowflake Interview Question on Recovering a Mistakenly Dropped Production Table
A junior runs DROP TABLE dim_customer; in prod at 14:05:32 UTC. The team notices at 14:08:00. The account is on Enterprise edition; the dim_customer table has 90-day retention configured. Recover the table with zero data loss and zero downtime for downstream dashboards.
Solution Using UNDROP TABLE (and a fallback to CLONE … AT (TIMESTAMP => …))
Code solution.
-- fast path: UNDROP restores the table object and its data
UNDROP TABLE dim_customer;
-- or, if the table name has already been reused, clone the historical state
CREATE TABLE dim_customer_restored
CLONE dim_customer AT (TIMESTAMP => '2026-05-11 14:05:00'::TIMESTAMP);
-- verify row counts vs the source-of-truth replica
SELECT COUNT(*) FROM dim_customer;
Step-by-step trace.
| step | time | action | result |
|---|---|---|---|
| 1 | 14:05:32 |
DROP TABLE dim_customer runs |
table dropped; metadata moved to "dropped" |
| 2 | 14:08:00 | engineer notices | table still recoverable via Time Travel |
| 3 | 14:08:30 | UNDROP TABLE dim_customer |
table restored with full data |
| 4 | 14:08:45 | SELECT COUNT(*) FROM dim_customer |
matches the row count before drop |
| 5 | 14:09:00 | downstream dashboards re-run | green |
Output: the table is back in place with every row intact; downstream queries that fired between 14:05:32 and 14:08:30 errored but those errors are transient and the next refresh succeeds; total recovery time ≈ 3 minutes.
Why this works — concept by concept:
-
UNDROP TABLE— Snowflake's shortcut for restoring a dropped object within the retention window; one statement, instant. - 90-day retention — purchased via Enterprise edition; absorbs the worst-case "we noticed a week later" recovery scenario.
-
CLONE … AT (TIMESTAMP => …)— fallback if the table name was reused; recreates the historical state as a new table. - Zero data motion — the dropped table's micro-partitions never left storage; recovery is metadata-only.
- No external backup needed — Time Travel is the backup for the retention window.
-
Cost— restore is metadata-only; the retention storage cost was paid throughout the 90 days regardless of whether anyone used it.
Inline CTA: drill recovery and data-quality scenarios on the ETL practice page.
ETL
Topic — ETL
ETL practice problems
SQL
Topic — filtering
SQL filtering problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
6. Performance optimization
Micro-partitions, query pruning, result caching, and clustering
Snowflake's performance story is built on three automatic layers — micro-partitions (the storage unit), query pruning (skip partitions whose stats prove they cannot match), and result caching (serve identical recent queries with no compute). On top of that, you can guide the optimiser with clustering keys for very large tables that need predictable partition layout. You will rarely tune indexes (there aren't any) — instead, you tune which partitions exist and which the planner can skip.
Pro tip: When a Snowflake query is slow, the right diagnostic is the query profile in the UI — look at partitions scanned vs total. If you're scanning 100% of partitions for a date-range query, the table either lacks a useful cluster key or the predicate isn't pruneable.
Micro-partitions and automatic clustering
The micro-partition invariant: Snowflake automatically chops every table into immutable 50–500 MB compressed columnar files, each carrying per-column min/max/distinct statistics; "clustering" is the optional act of giving Snowflake a hint about which column(s) should drive partition layout for predictable date-range or key-range pruning. Most tables don't need explicit clustering; the very large ones do.
- Automatic partitioning — every insert produces new partitions; no DDL needed.
- Statistics per partition — min/max/distinct for every column; powers pruning.
-
CLUSTER BY (col)— hint that Snowflake should keep partitions ordered on that column. - Re-clustering — background service that recompacts partitions when clustering drifts.
-
Pruning ratio —
partitions scanned / total partitions; visible in query profile.
Worked example. A fact_clicks table at 50 B rows:
| design | predicate WHERE click_date = '2026-05-10'
|
partitions scanned |
|---|---|---|
| no cluster key | natural append order | 100% (no pruning) |
CLUSTER BY (click_date) |
partitions sorted by date | 0.3% |
Step-by-step.
- Without clustering, Snowflake's per-partition min/max for
click_datecovers the whole date range — every partition might match. - With
CLUSTER BY (click_date), each partition's date range is tight; the planner can skip every partition outside the predicate. - The query goes from a full table scan to a needle-in-haystack pull.
- Re-clustering runs in the background as new data arrives, keeping the layout tight.
- The cluster key should match the most common range predicate, not every predicate.
Worked-example solution. Cluster a high-volume fact:
ALTER TABLE fact_clicks CLUSTER BY (click_date, customer_id);
-- check clustering depth (1.0 = perfectly clustered, higher = worse)
SELECT SYSTEM$CLUSTERING_INFORMATION('fact_clicks', '(click_date)');
Rule of thumb: most tables (< 1 TB) don't need explicit clustering. The very large date-partitioned facts do, and the cluster key is almost always the date column.
Query pruning — skip partitions whose stats prove they cannot match
The pruning invariant: the optimiser uses per-partition column statistics to prove that some partitions cannot contain rows matching the WHERE predicate; those partitions are skipped — never read from storage — and the query reads only the relevant subset. Pruning is automatic and invisible until you check the query profile.
-
Date predicates —
WHERE date_col BETWEEN x AND yprunes by partition date range. -
Equality predicates —
WHERE col = xprunes by per-partition min/max. -
IN (...)lists — pruned by each value. -
Function-wrapped columns —
WHERE DATE(ts) = …may not prune; raw column comparisons do. -
Profile shows
Partitions scanned : 12 / 4567— that ratio is the pruning signal.
Worked example. Same fact_clicks query, different predicates:
| predicate | partitions scanned |
|---|---|
WHERE click_date = '2026-05-10' |
12 / 4,567 (0.3%) |
WHERE customer_id = 4242 |
4,567 / 4,567 (no pruning unless clustered) |
WHERE DATE(click_ts) = '2026-05-10' |
4,567 / 4,567 (function disables pruning) |
Step-by-step.
- Date predicate matches the cluster key; pruning is excellent.
- Customer-id predicate cannot prune because customer_ids are scattered across all partitions.
- Wrapping the date column in
DATE(…)disables pruning because Snowflake cannot use min/max on the computed value. - The query profile makes this visible — "Partitions scanned: X / Y" is the first line to read.
- The fix for the function-wrapped predicate is to compare the raw column:
WHERE click_ts >= '2026-05-10' AND click_ts < '2026-05-11'.
Worked-example solution. Pruning-friendly date filter:
-- prunes
SELECT * FROM fact_clicks
WHERE click_ts >= '2026-05-10'
AND click_ts < '2026-05-11';
-- does NOT prune
SELECT * FROM fact_clicks
WHERE DATE(click_ts) = '2026-05-10';
Rule of thumb: keep predicates on the raw clustered column. Anything that wraps the column in a function disables the planner's ability to use partition statistics.
Result caching — free wins for repeated queries
The cache invariant: the cloud-services layer remembers the result of every query for 24 hours; an identical query against unchanged tables returns the cached result instantly, with zero warehouse compute. The cache is account-wide — different users running the same SQL share the same cached result.
- Cache lifetime — 24 hours of inactivity; extends with re-use up to 31 days.
- Cache key — exact SQL text + same underlying data state.
-
Invalidation — any change to a referenced table or any non-deterministic function (
CURRENT_TIMESTAMP). - No warehouse needed — the cache responds even when the warehouse is suspended.
- Cache misses — the warehouse runs the query and the result is cached for the next user.
Worked example. Three runs of the same dashboard query:
| run | warehouse compute | latency | credit cost |
|---|---|---|---|
| 1 | warehouse runs | 2,400 ms | small |
| 2 (cache hit) | none | 80 ms | 0 |
| 3 (after data change) | warehouse runs | 2,400 ms | small |
Step-by-step.
- First analyst clicks the dashboard tile; the warehouse runs the query; result cached.
- Second analyst clicks the same tile minutes later; cache hit; warehouse is suspended the entire time.
- ETL adds new rows to the underlying table; cache invalidates.
- Third analyst clicks; cache miss; warehouse spins back up to recompute.
- The pattern dominates BI workloads — most dashboard refreshes hit the cache because the data only changes once a day.
Worked-example solution. Verify cache behaviour:
SHOW PARAMETERS LIKE 'USE_CACHED_RESULT'; -- TRUE by default
ALTER SESSION SET USE_CACHED_RESULT = TRUE;
SELECT COUNT(*) FROM fact_orders; -- compute
SELECT COUNT(*) FROM fact_orders; -- cache hit (~80 ms)
Rule of thumb: never benchmark Snowflake without disabling result cache for the test. Production benefits hugely from the cache; benchmarks lie when you don't account for it.
Common beginner mistakes
- Trying to create B-tree indexes — Snowflake has none; tuning happens via clustering and partition pruning.
- Clustering small tables — overhead exceeds benefit until you cross ~1 TB.
- Wrapping clustered columns in functions in
WHERE— disables pruning silently. - Believing the result cache is "always on" — any underlying-data change invalidates.
- Benchmarking with cache enabled — produces misleadingly low numbers; disable cache for honest tests.
Snowflake Interview Question on Speeding Up a 60-Second Daily Report
The daily revenue report on a 5 B-row fact_orders table takes 60 seconds. The query is SELECT customer_id, SUM(amount) FROM fact_orders WHERE order_date = CURRENT_DATE GROUP BY 1. Get it under 5 seconds without buying a bigger warehouse.
Solution Using Clustering on order_date + a Raw-Column Predicate + Materialised View
Code solution.
-- 1. cluster the table by the most common range predicate
ALTER TABLE fact_orders CLUSTER BY (order_date);
-- 2. rewrite the predicate so it can prune
SELECT customer_id, SUM(amount)
FROM fact_orders
WHERE order_date = CURRENT_DATE -- raw column, prunes
GROUP BY customer_id;
-- 3. for repeated daily access, create a materialised view
CREATE MATERIALIZED VIEW mv_daily_customer_revenue AS
SELECT order_date, customer_id, SUM(amount) AS revenue
FROM fact_orders
GROUP BY 1, 2;
Step-by-step trace.
| step | action | result |
|---|---|---|
| 1 | baseline scan of entire 5 B rows | 60 s |
| 2 |
CLUSTER BY (order_date) (background re-cluster runs) |
tighter date min/max per partition |
| 3 | rerun query — pruning kicks in | 4 s |
| 4 | result cache hit on repeated daily reads | < 100 ms |
| 5 | optional MV for sub-second pre-aggregated rollup | < 50 ms |
Output: the daily report goes from 60 s to 4 s on the first run of the day (clustered scan), then milliseconds for repeat hits (result cache). The materialised view turns even cold runs into milliseconds for the pre-aggregated rollup.
Why this works — concept by concept:
-
CLUSTER BY (order_date)— partitions co-locate by date so the predicate prunes ~99.5% of them. -
Raw-column predicate —
WHERE order_date = CURRENT_DATEis pruneable; wrapping in a function would not be. - Result cache for repeats — second and subsequent analysts to view the dashboard get sub-100 ms without warehouse compute.
- Materialised view — pre-aggregates the rollup; daily query becomes a tiny aggregate read.
- No bigger warehouse needed — performance gains come from scanning less data, not throwing more compute at the same scan.
-
Cost— clustering adds background re-cluster cost; MV adds maintenance cost; both are far smaller than running aXLARGEwarehouse for 60 s × N analysts.
Inline CTA: sharpen pruning-aware SQL on the SQL practice page and the aggregation topic.
SQL
Topic — aggregations
SQL aggregation problems
SQL
Topic — window functions
Window-function problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
7. Snowflake vs Redshift vs BigQuery
How the big warehouses differ — and how to pick one
Three cloud data warehouses dominate modern data-engineering interviews: Snowflake, Amazon Redshift, and Google BigQuery. Snowflake vs Databricks is the next-most-asked comparison, and Azure shops often add Synapse to the shortlist — so be ready for any pairing in the wider Snowflake / Databricks / BigQuery / Synapse decision. All four serve analytical workloads at scale; they differ in operational model, pricing, cloud lock-in, and how they pair with transformation tools like dbt (the dbt Snowflake integration is the de-facto modeling layer for most Snowflake teams). Knowing the trade-offs in one or two sentences each is enough to handle the "why this one?" interview follow-up.
Pro tip: Never say "X is best." Always frame the answer as which tool fits which workload. Interviewers test whether you understand the trade-offs, not whether you can pick a winner.
Snowflake vs Redshift — compute/storage coupling and cloud lock-in
The Redshift comparison invariant: classic Redshift (provisioned) tightly couples compute and storage; modern Redshift Serverless decouples them and looks more like Snowflake; Snowflake is multi-cloud while Redshift is AWS-only; Snowflake's ease-of-use is consistently rated higher but Redshift can be cheaper at steady-state on AWS. Pick Redshift if you are deep in AWS and want the cheapest steady-state bill; pick Snowflake if you need multi-cloud, easier ops, or per-team isolation.
- Cloud support — Snowflake: AWS / GCP / Azure. Redshift: AWS only.
- Compute/storage — Snowflake: fully separated. Redshift: provisioned = coupled; Serverless = separated.
- Maintenance — Snowflake: near-zero. Redshift: some tuning (VACUUM, ANALYZE, distkey, sortkey).
- Scaling — Snowflake: easier; resize in seconds. Redshift: provisioned resize involves data redistribution.
-
Semi-structured — Snowflake: excellent native
VARIANT. Redshift: goodSUPERtype.
Worked example. Same workload on both:
| dimension | Snowflake | Redshift (provisioned) |
|---|---|---|
| spin up a warehouse | 5 s | minutes |
| resize compute | seconds, no data motion | minutes, data redistribution |
| add a TB of data | no compute change | may need to resize cluster |
| 10 concurrent dashboards | multi-cluster auto-scales | needs Concurrency Scaling addon |
| credit billing | per-second | per-hour (provisioned) / per-second (Serverless) |
Step-by-step.
- Both can serve the same analytical SQL workload at scale.
- Snowflake's operational ergonomics are simpler — no
VACUUM, no manual sort/distkeys, easier resize. - Redshift on AWS is often cheaper at steady-state because AWS sells it at a discount within their ecosystem.
- Multi-cloud needs (data in GCP + analytics in AWS) lean strongly Snowflake.
- The choice is rarely about features — it is about who manages the warehouse and how aggressive the cost target is.
Worked-example solution. Quick comparison line for an interview:
Snowflake : multi-cloud, separated compute/storage, near-zero maintenance,
per-second billing, excellent VARIANT.
Redshift : AWS-only, provisioned = coupled / Serverless = separated,
some tuning required, cheaper at steady-state on AWS.
Rule of thumb: deep AWS shop with steady utilisation → Redshift. Multi-cloud, spiky workload, small ops team → Snowflake.
Snowflake vs BigQuery — warehouse compute vs serverless
The BigQuery comparison invariant: BigQuery is serverless — no warehouses, just a query that scans bytes and is billed per byte scanned; Snowflake bills per second of warehouse uptime; BigQuery is GCP-only; both have excellent semi-structured support; the cost model fundamentally differs. Pick BigQuery if you're on GCP and want zero compute management; pick Snowflake if you want multi-cloud or predictable monthly compute spend.
- Cloud — Snowflake: AWS / GCP / Azure. BigQuery: GCP only.
- Compute model — Snowflake: virtual warehouses. BigQuery: serverless slots.
- Pricing — Snowflake: warehouse credits per second. BigQuery: per TB scanned (on-demand) or flat-rate slots.
- Concurrency — Snowflake: explicit warehouse choice. BigQuery: implicit; slots dynamically assigned.
- Cost predictability — Snowflake: more predictable (you choose the warehouse). BigQuery: depends entirely on query patterns.
Worked example. Cost shape for one query:
| query | Snowflake (SMALL warehouse) | BigQuery (on-demand) |
|---|---|---|
| 1 TB scan | 60 s × 2 credits = $6 | 1 TB × $5/TB = $5 |
| same query, 100× | warehouse keeps running | 100 TB × $5 = $500 |
| same query, result cache hit (Snowflake) | free | not applicable in BigQuery |
| same query, cached (BigQuery) | — | free for 24 h |
Step-by-step.
- For one-off queries, costs are similar.
- For repeated identical queries, both have result caches that make subsequent runs free.
- For repeated different queries on the same data, BigQuery scales with bytes scanned per query; Snowflake scales with warehouse uptime.
- Heavy ad-hoc exploration may be cheaper on Snowflake (one warehouse, many queries) than BigQuery (each query bills bytes).
- Heavy variable workloads with idle gaps may be cheaper on BigQuery (no warehouse to suspend).
Worked-example solution. Quick interview-shape comparison:
Snowflake : warehouses, per-second compute billing, multi-cloud,
predictable cost if warehouses are right-sized.
BigQuery : serverless, per-byte-scanned billing, GCP-only,
cost scales with bytes per query — partition / cluster
tables aggressively to keep bytes small.
Rule of thumb: GCP shop with many small queries → BigQuery. Multi-cloud or heavy ad-hoc analyst workload → Snowflake.
When to pick what — the one-paragraph decision
The selection invariant: pick the warehouse that matches (a) the cloud your data already lives in, (b) the workload shape (steady vs spiky, dashboards vs ad-hoc), and (c) the size of your data-engineering team; do not over-rotate on features — all three serve the analytical workload well. The bigger questions are operational and economic, not technical.
- Cloud-first — match the warehouse to where the data lives (cross-cloud egress is real money).
- Workload-first — bursty / spiky → Snowflake or BigQuery on-demand; steady → Redshift Serverless or BigQuery flat-rate.
- Team-size-first — small team needs near-zero maintenance → Snowflake or BigQuery; bigger team can afford Redshift tuning.
- Cost-first — steady AWS workload → Redshift; multi-cloud or bursty → Snowflake.
Worked example. Decision matrix for three scenarios:
| scenario | best fit | reason |
|---|---|---|
| 5 TB, AWS-only, steady analyst workload, small team | Snowflake or Redshift Serverless | both work; Snowflake is easier |
| 500 GB, GCP-only, dashboards | BigQuery | native fit; no warehouse to size |
| 50 TB, multi-cloud, weekly backfills | Snowflake | only one that's multi-cloud + handles spikes |
Step-by-step.
- Start with cloud — if you're locked to AWS or GCP, you've narrowed the options.
- Then workload shape — steady-state vs bursty changes whether warehouse-based billing or per-byte billing is cheaper.
- Then team size — smaller teams pay for managed-service simplicity in implicit hours saved.
- Cost is the last check — all three are within 2× of each other for most workloads.
- The "right" answer is rarely a technical one; it's the one your team can operate without burning out.
Worked-example solution. Decision flowchart in text:
Q: which cloud is the source data on?
AWS only → Redshift or Snowflake (Snowflake if multi-cloud likely)
GCP only → BigQuery or Snowflake
Azure only → Snowflake (BigQuery is GCP-only)
Multi-cloud → Snowflake
Q: workload shape?
Steady, predictable → flat-rate (Redshift Serverless / BigQuery flat-rate)
Bursty, mostly idle → on-demand (Snowflake auto-suspend / BigQuery on-demand)
Q: team size + ops appetite?
Small + want easy → Snowflake or BigQuery
Big + want control → Redshift provisioned
Rule of thumb: the right answer is the one your team can operate at 3 AM without paging an expert.
Common beginner mistakes
- Declaring one warehouse "best" — the correct answer is always conditional on the workload.
- Comparing on-demand BigQuery to provisioned Redshift — different cost models entirely.
- Forgetting cross-cloud egress charges when picking a warehouse on a different cloud than your source.
- Overestimating Snowflake's premium over Redshift — at steady state, the gap is often smaller than the operational savings.
- Underestimating ease-of-use — engineering hours saved by a managed warehouse are real money.
Snowflake Interview Question on Choosing a Warehouse for a Specific Scenario
You're advising a startup: their product runs on AWS, they have ~10 TB of analytical data growing 1 TB/month, three full-time analysts, and a small data-engineering team. They want sub-second BI dashboards and a daily ETL. Budget is "reasonable, not unlimited." Pick one warehouse and defend the choice in two paragraphs.
Solution Using Snowflake on AWS with Per-Team Warehouses + Auto-Suspend
Code solution.
-- ETL warehouse
CREATE WAREHOUSE WH_ETL
WITH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;
-- BI warehouse with multi-cluster for concurrency
CREATE WAREHOUSE WH_BI
WITH WAREHOUSE_SIZE = 'SMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 3;
-- analyst warehouse for ad-hoc
CREATE WAREHOUSE WH_ANALYSTS
WITH WAREHOUSE_SIZE = 'LARGE' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;
Step-by-step trace of the decision:
| consideration | answer |
|---|---|
| source-data cloud | AWS — both Snowflake and Redshift fit |
| workload | mixed: nightly ETL (bursty) + BI (concurrent) + analyst (spiky) |
| team size | small data-eng team |
| size today | 10 TB, growing 1 TB/month |
| operational burden tolerance | low |
| result | Snowflake on AWS |
Output: the startup gets sub-second BI (result cache + multi-cluster WH_BI), idle warehouses auto-suspend (cost), the data-engineering team doesn't spend Friday afternoons on VACUUM and distkey tuning (no Redshift-style maintenance), and a future move off AWS doesn't force a warehouse migration.
Why this works — concept by concept:
- Snowflake on AWS — keeps the data on the same cloud as the product; same-cloud egress is minimal.
- Three warehouses — isolates ETL from BI from analysts; one slow query never blocks another.
-
AUTO_SUSPEND = 60s— idle compute is not billed; the warehouses run only when there's work. -
Multi-cluster
WH_BI— handles the 9 AM dashboard concurrency spike without a bigger warehouse. -
Near-zero maintenance — no
VACUUM, no distkey, no manual partition tuning; small team can run it. -
Cost— proportional to actual usage; the always-on cost of a Redshift provisioned cluster is the worst fit for a startup's spiky workload.
Inline CTA: see the full ETL-and-warehouse playbook in ETL System Design for Data Engineering Interviews.
ETL
Topic — ETL
ETL practice problems
SQL
Language — SQL
All SQL practice problems
COURSE
Course — ETL System Design
ETL System Design for DE Interviews
Choosing Snowflake (checklist)
| If your workload looks like… | Snowflake is a good fit because… | Watch out for… |
|---|---|---|
| Analytical SQL over 10 GB–10 PB | Columnar storage + parallel compute | Tiny datasets are cheaper on DuckDB / SQLite |
| Multi-team concurrent dashboards | Per-team warehouses + multi-cluster | Forgetting AUTO_SUSPEND
|
| Multi-cloud or cloud-agnostic | Runs on AWS, GCP, Azure | Cross-region egress costs |
| Spiky workloads with idle gaps | Per-second billing + auto-suspend | Always-on warehouses are wasteful |
| Need Time Travel + dev clones | Built-in, zero-cost cloning | Long retention on high-churn tables is expensive |
| Semi-structured JSON / Parquet | First-class VARIANT type + COPY INTO JSON |
Storing tabular data as JSON wastes columnar |
Pro tip: When you propose Snowflake in a system-design round, immediately name the three layers and the per-team warehouse split. Those two sentences turn a generic answer into one that signals you've actually run Snowflake in production.
Frequently asked questions
What is Snowflake?
Snowflake is a cloud-native data warehouse / data platform built on three independent layers — storage, compute (virtual warehouses), and cloud services. It runs as a managed service on AWS, GCP, and Azure, and is optimised for analytical SQL workloads over very large datasets.
What is a virtual warehouse?
A virtual warehouse is a named, sized, isolated compute cluster that runs your queries. Warehouses can be created, suspended, resumed, and resized independently of one another and independently of the data they read. The pricing model is per-second of warehouse uptime.
What is separation of compute and storage?
Snowflake stores every table on cloud object storage that is decoupled from any compute cluster. Compute (virtual warehouses) and storage scale independently — you can resize compute in seconds with no data motion and add petabytes of storage without changing compute sizing.
What is Time Travel?
Time Travel is the ability to query a table's historical state via AT (TIMESTAMP => …) or BEFORE (STATEMENT => …) clauses, within the table's retention window (1–90 days). It powers UNDROP TABLE and accidental-write recovery without external backups.
What is zero-copy cloning?
Zero-copy cloning uses CREATE … CLONE to produce a new database, schema, or table that shares the source's underlying micro-partitions. No data is copied at clone time; the clone diverges only when either side writes. Ideal for instant dev / test environments.
How does Snowflake compare to Redshift and BigQuery?
Snowflake runs on AWS, GCP, and Azure with fully separated compute and storage. Redshift is AWS-only and was historically coupled (Redshift Serverless changes that). BigQuery is serverless and GCP-only, billing per byte scanned. Pick by cloud, workload shape, and team-size, not by feature count.
How long does it take to learn Snowflake?
If your SQL fluency is solid, the core ideas (warehouses, separation of compute and storage, COPY INTO, Time Travel, cloning) take 1–2 weeks of focused practice. Advanced topics (clustering, materialised views, streams + tasks, multi-cluster tuning) take another 2–4 weeks of real-world use.
Practice on PipeCode
PipeCode ships 450+ data engineering practice problems — SQL uses the PostgreSQL dialect, with editorials and topics aligned to the same patterns Snowflake interviewers ask. Start from Explore practice →, open SQL practice →, filter by ETL → or aggregations →, and see plans → when you want the full library.





Top comments (0)