SQL Course Notes: From First SELECT to Production-Ready Queries
Structured Query Language (SQL) is how applications, analysts, and data engineers talk to relational databases. This post is a detailed course-style reference: the relational model, DDL and DML, joins and aggregations, subqueries and window functions, indexes and transactions, normalization, and how SQL fits cloud data work—written so you can study once and return when debugging slow queries or designing schemas.
In short
Tables hold rows; keys link tables; SQL reads and writes through declarative statements. Master SELECT + JOIN + GROUP BY first, then indexes and transactions, then dialect-specific features (PostgreSQL, MySQL) and how SQL connects to pipelines and ML feature stores.
Why SQL still matters in 2026
Applications store state somewhere. For decades, the default has been the relational database: rows in tables, relationships enforced by keys, and a shared language—SQL—to define structure and ask questions. NoSQL, data lakes, and vector stores expanded the toolbox, but SQL remains the lingua franca for transactional systems, reporting, feature engineering, and increasingly for analytics warehouses (Snowflake, BigQuery, Redshift all speak SQL dialects).
A solid SQL course teaches three layers at once:
- Conceptual — entities, relationships, integrity, and when normalization helps or hurts.
- Language — statements you type in a client or ORM-generated query.
- Operational — indexes, plans, locks, backups, and how cloud managed databases change defaults.
This post walks all three. Examples use ANSI-style SQL with notes where PostgreSQL and MySQL diverge—matching the stacks I use in platform and application work.
The relational model in one picture
Think in tables (relations), not spreadsheets with formulas. Each table has a fixed set of columns (attributes) and zero or more rows (tuples). Relationships between tables are expressed by storing keys, not by drawing arrows in application memory.
| Concept | Meaning | Example |
|---|---|---|
| Primary key (PK) | Uniquely identifies a row in its table | customer_id |
| Foreign key (FK) | References a PK in another table; enforces referential integrity | orders.customer_id → customers.customer_id |
| Candidate key | Any column set that could be PK (email, national ID) | UNIQUE (email) |
| NULL | “Unknown / not applicable,” not zero or empty string | Optional middle name |
Sample schema for a small e-commerce domain:
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
full_name VARCHAR(200) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT NOT NULL REFERENCES customers(customer_id),
order_date DATE NOT NULL DEFAULT CURRENT_DATE,
status VARCHAR(20) NOT NULL DEFAULT 'pending'
);
CREATE TABLE order_items (
order_id INT NOT NULL REFERENCES orders(order_id) ON DELETE CASCADE,
product_id INT NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(10,2) NOT NULL,
PRIMARY KEY (order_id, product_id)
);
SQL statement families (DDL, DML, DCL, TCL)
| Family | Purpose | Common statements |
|---|---|---|
| DDL (Data Definition) | Shape of the database | CREATE, ALTER, DROP, TRUNCATE |
| DML (Data Manipulation) | Rows in tables | SELECT, INSERT, UPDATE, DELETE, MERGE |
| DCL (Data Control) | Permissions | GRANT, REVOKE |
| TCL (Transaction Control) | Unit of work boundaries | BEGIN, COMMIT, ROLLBACK, SAVEPOINT |
Course labs usually spend most time on SELECT and JOIN; production work adds DDL migrations (Flyway, Liquibase, Alembic) and transaction discipline.
Reading data: SELECT and filtering
SELECT is declarative: you describe the result shape; the optimizer chooses how to fetch it.
SELECT c.full_name, o.order_id, o.order_date, o.status
FROM customers AS c
JOIN orders AS o ON o.customer_id = c.customer_id
WHERE o.order_date >= DATE '2026-01-01'
AND o.status IN ('shipped', 'delivered')
ORDER BY o.order_date DESC, o.order_id
LIMIT 50;
Clause order of evaluation (logical, not always physical): FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.
Filtering tools to know cold:
- Comparisons —
=,<>,<,BETWEEN,IN,LIKE(patterns; preferILIKEin Postgres for case-insensitive). - NULL logic —
IS NULL/IS NOT NULL; rememberNULL = NULLis unknown, not true. - Boolean logic —
AND,OR,NOT; use parentheses when mixing conditions. - Sets —
UNION(dedupe) vsUNION ALL(keep duplicates, often faster).
Joins: how tables combine
Joins are the heart of relational SQL. You match rows on a predicate—usually equality of keys.
| Join | Result rows | When to use |
|---|---|---|
| INNER JOIN | Only matches on both sides | Default: “customers who have orders” |
| LEFT JOIN | All left rows; NULLs where no right match | “All customers, including those with zero orders” |
| RIGHT JOIN | Mirror of LEFT | Rare; swap tables and use LEFT instead |
| FULL OUTER | All rows from both; NULLs where missing | Reconciliation, data quality diffs |
| CROSS JOIN | Cartesian product | Calendar generation, combinatorics—use carefully |
Count orders per customer (aggregation + join):
SELECT c.customer_id,
c.full_name,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.full_name
ORDER BY order_count DESC;
Anti-join pattern (customers with no orders):
SELECT c.*
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;
Or with NOT EXISTS, which often optimizes well:
SELECT c.*
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);
Aggregations and GROUP BY
Aggregate functions collapse many rows into one value per group: COUNT, SUM, AVG, MIN, MAX, and in PostgreSQL STRING_AGG, ARRAY_AGG, JSON_AGG.
Rule: every non-aggregated column in SELECT must appear in GROUP BY (or be functionally dependent on the group key in engines that allow it).
SELECT DATE_TRUNC('month', o.order_date) AS month,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'delivered'
GROUP BY 1
HAVING SUM(oi.quantity * oi.unit_price) > 10000
ORDER BY month;
WHERE filters rows before grouping; HAVING filters after grouping. Putting row filters in HAVING is a common course mistake—it works but confuses readers and can block index use.
Subqueries, CTEs, and readable SQL
A subquery nests inside another statement. A Common Table Expression (CTE) names a subquery with WITH for clarity and reuse:
WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', o.order_date) AS month,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'delivered'
GROUP BY 1
)
SELECT month, revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue
ORDER BY month;
When to prefer each:
- CTE — Multi-step reports, readable pipelines, same subquery referenced twice.
- Scalar subquery — Single value in SELECT or WHERE (
WHERE total > (SELECT AVG(total) FROM orders)). - EXISTS — Semi-joins (“has at least one related row”) without duplicating parent rows.
- IN — Small static lists; for large sets,
JOINorEXISTSoften plans better.
PostgreSQL also supports recursive CTEs for hierarchies (org charts, bill of materials)—worth a second course module.
Window functions (analytics without collapsing rows)
Aggregates return one row per group. Window functions add a column while keeping row granularity, using OVER (PARTITION BY … ORDER BY …).
SELECT order_id,
customer_id,
order_date,
SUM(quantity * unit_price) OVER (PARTITION BY order_id) AS order_total,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS nth_order
FROM orders o
JOIN order_items oi USING (order_id);
Functions to learn in order:
- Ranking —
ROW_NUMBER,RANK,DENSE_RANK - Offsets —
LAG,LEADfor period-over-period metrics - Frames —
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWfor running totals
Writing and changing data
INSERT INTO customers (email, full_name)
VALUES ('[email protected]', 'Ada Lovelace')
RETURNING customer_id;
UPDATE orders
SET status = 'cancelled'
WHERE order_id = 42 AND status = 'pending';
DELETE FROM order_items
WHERE order_id = 42;
Upsert (insert or update on conflict) is dialect-specific but essential in APIs:
-- PostgreSQL
INSERT INTO customers (email, full_name)
VALUES ('[email protected]', 'Ada Lovelace')
ON CONFLICT (email) DO UPDATE
SET full_name = EXCLUDED.full_name;
Always scope UPDATE and DELETE with WHERE. An unscoped UPDATE in production is a résumé-generating event.
Constraints and data integrity
Integrity belongs in the database when multiple apps or batch jobs touch the same tables:
- NOT NULL — Column must have a value.
- UNIQUE — No duplicate values (NULLs usually allowed once per SQL standard).
- PRIMARY KEY — NOT NULL + unique row identity.
- FOREIGN KEY — Parent row must exist; choose
ON DELETE CASCADE,SET NULL, orRESTRICTdeliberately. - CHECK — Row-level rules (
quantity > 0, allowed status enum). - DEFAULT — Sensible values at insert time (
created_at, status).
Application validation is still required for UX; constraints are the last line of defense against bugs and ad-hoc SQL.
Normalization (design course module)
Normalization reduces redundancy and update anomalies. A typical course covers:
- 1NF — Atomic columns; no repeating groups (no “phone1, phone2” columns—use a child table).
- 2NF — No partial dependency on a composite key.
- 3NF — No non-key attribute depends on another non-key attribute.
In practice you also denormalize on purpose for read performance (materialized summaries, JSON blobs for display-only fields)—but you should know what you are trading away.
Indexes and why queries get slow
An index is a sorted structure (commonly B-tree) that lets the engine find rows without scanning the whole table. Create indexes for columns that appear in WHERE, JOIN, and ORDER BY on large tables—but every index slows writes and uses disk.
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date DESC);
-- PostgreSQL: partial index for hot paths
CREATE INDEX idx_orders_pending
ON orders (order_date)
WHERE status = 'pending';
Learn to read a plan:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 7 AND status = 'pending';
Red flags in plans: Seq Scan on huge tables, nested loops with massive row counts, sort steps that spill to disk. Fixes: better indexes, updated statistics (ANALYZE), rewrite joins, or—last resort—schema change.
Composite index column order matters: leading columns must match how you filter ((customer_id, order_date) helps WHERE customer_id = ?; it may not help WHERE order_date = ? alone).
Transactions and ACID
A transaction groups statements into one unit: all succeed (COMMIT) or none apply (ROLLBACK).
- Atomicity — All or nothing.
- Consistency — Constraints hold before and after.
- Isolation — Concurrent sessions do not see each other’s half-finished work (levels: Read Committed, Repeatable Read, Serializable).
- Durability — Committed data survives crash after WAL/fsync rules are met.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
Deadlocks happen when two transactions lock rows in opposite order—retry with backoff in application code. Long transactions block vacuum/maintenance and hold connections—keep units of work short.
Views, procedures, and who owns business logic
A view is a stored SELECT; use it to simplify reporting and enforce column subsets. Materialized views store results (refresh on schedule) for heavy dashboards.
Stored procedures push logic into the database—good for batch ETL close to data; risky when product logic becomes hard to test in CI. Most web teams keep business rules in application code and use SQL for data access plus migrations.
Dialects: PostgreSQL vs MySQL (and cloud warehouses)
| Topic | PostgreSQL | MySQL (InnoDB) |
|---|---|---|
| Auto-increment | SERIAL / GENERATED … IDENTITY | AUTO_INCREMENT |
| Boolean | Native BOOLEAN | TINYINT(1) convention |
| Upsert | ON CONFLICT | ON DUPLICATE KEY UPDATE |
| JSON | JSONB + rich operators | JSON type + functions |
| Full-text | tsvector / GIN indexes | FULLTEXT indexes |
| Vectors (AI) | pgvector extension | Varies; often external store |
On AWS, managed SQL maps to RDS / Aurora, Cloud SQL, Azure Database—see that mapping post when you place workloads on a hyperscaler.
SQL in data engineering and ML
Downstream of transactional databases, SQL reappears in:
- Warehouses — Transform raw events into dimensions and facts (star schema).
- dbt-style pipelines — Versioned SQL models with tests.
- Feature stores — Point-in-time correct joins so training data does not leak future information.
- RAG — Metadata filters plus vector search; PostgreSQL with pgvector is a common pattern (see How to become an AI developer).
For pipeline mindset and quality, pair this course with Data Engineering on AWS.
Course lab progression (suggested)
- Install PostgreSQL locally (or use Docker). Run
psqland create the sample schema above. - Write 20 SELECTs: filters, sorting, limits, simple aggregates.
- Add JOIN exercises: inner, left, anti-join with
NOT EXISTS. - Introduce CTEs; rewrite nested subqueries as
WITHchains. - One window-function worksheet (running totals, top-N per group with
ROW_NUMBER). - Transaction lab: transfer balance; force rollback; observe isolation with two sessions.
- Index lab:
EXPLAIN ANALYZEbefore and after index on a 100k-row synthetic table. - Capstone: design schema for a domain you know (inventory, tickets, courses); document PK/FK and three reporting queries.
Common pitfalls (exam and production)
- SELECT * in application code — Breaks when columns are added; hides I/O cost.
- Implicit joins — Comma-FROM without WHERE; use explicit
JOINsyntax. - NULL in aggregates —
COUNT(*)vsCOUNT(column);AVGignores NULLs. - N+1 queries — ORM loads parent then one query per child; fix with join or batch load.
- Over-indexing — Every index taxes writes; measure before adding five indexes “just in case.”
- Logic only in the app — Missing FK/CHECK lets orphan rows into analytics forever.
What I took from the course
SQL is not a legacy checkbox—it is the contract between services and truth. The course worth finishing leaves you able to model data without obvious redundancy, query it clearly with joins and windows, and operate it with transactions and plans. Everything else—ORMs, warehouses, vector search—is built on that foundation.
Further reading
- PostgreSQL documentation — tutorials, EXPLAIN, and JSONB
- Use The Index, Luke — index and tuning mental models
- SQL style guides (e.g. dbt, Mozilla) for team consistency
- Mode Analytics SQL tutorial — window functions with exercises
Blog index · Data Engineering on AWS · How to become an AI developer