SQL Course Notes: From First SELECT to Production-Ready Queries

Structured Query Language (SQL) is how applications, analysts, and data engineers talk to relational databases. This post is a detailed course-style reference: the relational model, DDL and DML, joins and aggregations, subqueries and window functions, indexes and transactions, normalization, and how SQL fits cloud data work—written so you can study once and return when debugging slow queries or designing schemas.

In short

Tables hold rows; keys link tables; SQL reads and writes through declarative statements. Master SELECT + JOIN + GROUP BY first, then indexes and transactions, then dialect-specific features (PostgreSQL, MySQL) and how SQL connects to pipelines and ML feature stores.

Why SQL still matters in 2026

Applications store state somewhere. For decades, the default has been the relational database: rows in tables, relationships enforced by keys, and a shared language—SQL—to define structure and ask questions. NoSQL, data lakes, and vector stores expanded the toolbox, but SQL remains the lingua franca for transactional systems, reporting, feature engineering, and increasingly for analytics warehouses (Snowflake, BigQuery, Redshift all speak SQL dialects).

A solid SQL course teaches three layers at once:

  1. Conceptual — entities, relationships, integrity, and when normalization helps or hurts.
  2. Language — statements you type in a client or ORM-generated query.
  3. Operational — indexes, plans, locks, backups, and how cloud managed databases change defaults.

This post walks all three. Examples use ANSI-style SQL with notes where PostgreSQL and MySQL diverge—matching the stacks I use in platform and application work.

The relational model in one picture

Think in tables (relations), not spreadsheets with formulas. Each table has a fixed set of columns (attributes) and zero or more rows (tuples). Relationships between tables are expressed by storing keys, not by drawing arrows in application memory.

ConceptMeaningExample
Primary key (PK)Uniquely identifies a row in its tablecustomer_id
Foreign key (FK)References a PK in another table; enforces referential integrityorders.customer_id → customers.customer_id
Candidate keyAny column set that could be PK (email, national ID)UNIQUE (email)
NULL“Unknown / not applicable,” not zero or empty stringOptional middle name

Sample schema for a small e-commerce domain:

CREATE TABLE customers (
  customer_id   SERIAL PRIMARY KEY,
  email         VARCHAR(255) NOT NULL UNIQUE,
  full_name     VARCHAR(200) NOT NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE orders (
  order_id      SERIAL PRIMARY KEY,
  customer_id   INT NOT NULL REFERENCES customers(customer_id),
  order_date    DATE NOT NULL DEFAULT CURRENT_DATE,
  status        VARCHAR(20) NOT NULL DEFAULT 'pending'
);

CREATE TABLE order_items (
  order_id      INT NOT NULL REFERENCES orders(order_id) ON DELETE CASCADE,
  product_id    INT NOT NULL,
  quantity      INT NOT NULL CHECK (quantity > 0),
  unit_price    NUMERIC(10,2) NOT NULL,
  PRIMARY KEY (order_id, product_id)
);

SQL statement families (DDL, DML, DCL, TCL)

FamilyPurposeCommon statements
DDL (Data Definition)Shape of the databaseCREATE, ALTER, DROP, TRUNCATE
DML (Data Manipulation)Rows in tablesSELECT, INSERT, UPDATE, DELETE, MERGE
DCL (Data Control)PermissionsGRANT, REVOKE
TCL (Transaction Control)Unit of work boundariesBEGIN, COMMIT, ROLLBACK, SAVEPOINT

Course labs usually spend most time on SELECT and JOIN; production work adds DDL migrations (Flyway, Liquibase, Alembic) and transaction discipline.

Reading data: SELECT and filtering

SELECT is declarative: you describe the result shape; the optimizer chooses how to fetch it.

SELECT c.full_name, o.order_id, o.order_date, o.status
FROM   customers AS c
JOIN   orders AS o ON o.customer_id = c.customer_id
WHERE  o.order_date >= DATE '2026-01-01'
  AND  o.status IN ('shipped', 'delivered')
ORDER  BY o.order_date DESC, o.order_id
LIMIT  50;

Clause order of evaluation (logical, not always physical): FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT.

Filtering tools to know cold:

  • Comparisons=, <>, <, BETWEEN, IN, LIKE (patterns; prefer ILIKE in Postgres for case-insensitive).
  • NULL logicIS NULL / IS NOT NULL; remember NULL = NULL is unknown, not true.
  • Boolean logicAND, OR, NOT; use parentheses when mixing conditions.
  • SetsUNION (dedupe) vs UNION ALL (keep duplicates, often faster).

Joins: how tables combine

Joins are the heart of relational SQL. You match rows on a predicate—usually equality of keys.

JoinResult rowsWhen to use
INNER JOINOnly matches on both sidesDefault: “customers who have orders”
LEFT JOINAll left rows; NULLs where no right match“All customers, including those with zero orders”
RIGHT JOINMirror of LEFTRare; swap tables and use LEFT instead
FULL OUTERAll rows from both; NULLs where missingReconciliation, data quality diffs
CROSS JOINCartesian productCalendar generation, combinatorics—use carefully

Count orders per customer (aggregation + join):

SELECT c.customer_id,
       c.full_name,
       COUNT(o.order_id) AS order_count
FROM   customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.full_name
ORDER BY order_count DESC;

Anti-join pattern (customers with no orders):

SELECT c.*
FROM   customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE  o.order_id IS NULL;

Or with NOT EXISTS, which often optimizes well:

SELECT c.*
FROM   customers c
WHERE  NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

Aggregations and GROUP BY

Aggregate functions collapse many rows into one value per group: COUNT, SUM, AVG, MIN, MAX, and in PostgreSQL STRING_AGG, ARRAY_AGG, JSON_AGG.

Rule: every non-aggregated column in SELECT must appear in GROUP BY (or be functionally dependent on the group key in engines that allow it).

SELECT DATE_TRUNC('month', o.order_date) AS month,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM   orders o
JOIN   order_items oi ON oi.order_id = o.order_id
WHERE  o.status = 'delivered'
GROUP BY 1
HAVING SUM(oi.quantity * oi.unit_price) > 10000
ORDER BY month;

WHERE filters rows before grouping; HAVING filters after grouping. Putting row filters in HAVING is a common course mistake—it works but confuses readers and can block index use.

Subqueries, CTEs, and readable SQL

A subquery nests inside another statement. A Common Table Expression (CTE) names a subquery with WITH for clarity and reuse:

WITH monthly_revenue AS (
  SELECT DATE_TRUNC('month', o.order_date) AS month,
         SUM(oi.quantity * oi.unit_price) AS revenue
  FROM   orders o
  JOIN   order_items oi ON oi.order_id = o.order_id
  WHERE  o.status = 'delivered'
  GROUP BY 1
)
SELECT month, revenue,
       revenue - LAG(revenue) OVER (ORDER BY month) AS month_over_month
FROM   monthly_revenue
ORDER BY month;

When to prefer each:

  • CTE — Multi-step reports, readable pipelines, same subquery referenced twice.
  • Scalar subquery — Single value in SELECT or WHERE (WHERE total > (SELECT AVG(total) FROM orders)).
  • EXISTS — Semi-joins (“has at least one related row”) without duplicating parent rows.
  • IN — Small static lists; for large sets, JOIN or EXISTS often plans better.

PostgreSQL also supports recursive CTEs for hierarchies (org charts, bill of materials)—worth a second course module.

Window functions (analytics without collapsing rows)

Aggregates return one row per group. Window functions add a column while keeping row granularity, using OVER (PARTITION BY … ORDER BY …).

SELECT order_id,
       customer_id,
       order_date,
       SUM(quantity * unit_price) OVER (PARTITION BY order_id) AS order_total,
       ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS nth_order
FROM   orders o
JOIN   order_items oi USING (order_id);

Functions to learn in order:

  • RankingROW_NUMBER, RANK, DENSE_RANK
  • OffsetsLAG, LEAD for period-over-period metrics
  • FramesROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for running totals

Writing and changing data

INSERT INTO customers (email, full_name)
VALUES ('[email protected]', 'Ada Lovelace')
RETURNING customer_id;

UPDATE orders
SET    status = 'cancelled'
WHERE  order_id = 42 AND status = 'pending';

DELETE FROM order_items
WHERE  order_id = 42;

Upsert (insert or update on conflict) is dialect-specific but essential in APIs:

-- PostgreSQL
INSERT INTO customers (email, full_name)
VALUES ('[email protected]', 'Ada Lovelace')
ON CONFLICT (email) DO UPDATE
SET full_name = EXCLUDED.full_name;

Always scope UPDATE and DELETE with WHERE. An unscoped UPDATE in production is a résumé-generating event.

Constraints and data integrity

Integrity belongs in the database when multiple apps or batch jobs touch the same tables:

  • NOT NULL — Column must have a value.
  • UNIQUE — No duplicate values (NULLs usually allowed once per SQL standard).
  • PRIMARY KEY — NOT NULL + unique row identity.
  • FOREIGN KEY — Parent row must exist; choose ON DELETE CASCADE, SET NULL, or RESTRICT deliberately.
  • CHECK — Row-level rules (quantity > 0, allowed status enum).
  • DEFAULT — Sensible values at insert time (created_at, status).

Application validation is still required for UX; constraints are the last line of defense against bugs and ad-hoc SQL.

Normalization (design course module)

Normalization reduces redundancy and update anomalies. A typical course covers:

  • 1NF — Atomic columns; no repeating groups (no “phone1, phone2” columns—use a child table).
  • 2NF — No partial dependency on a composite key.
  • 3NF — No non-key attribute depends on another non-key attribute.

In practice you also denormalize on purpose for read performance (materialized summaries, JSON blobs for display-only fields)—but you should know what you are trading away.

Indexes and why queries get slow

An index is a sorted structure (commonly B-tree) that lets the engine find rows without scanning the whole table. Create indexes for columns that appear in WHERE, JOIN, and ORDER BY on large tables—but every index slows writes and uses disk.

CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date DESC);

-- PostgreSQL: partial index for hot paths
CREATE INDEX idx_orders_pending
ON orders (order_date)
WHERE status = 'pending';

Learn to read a plan:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 7 AND status = 'pending';

Red flags in plans: Seq Scan on huge tables, nested loops with massive row counts, sort steps that spill to disk. Fixes: better indexes, updated statistics (ANALYZE), rewrite joins, or—last resort—schema change.

Composite index column order matters: leading columns must match how you filter ((customer_id, order_date) helps WHERE customer_id = ?; it may not help WHERE order_date = ? alone).

Transactions and ACID

A transaction groups statements into one unit: all succeed (COMMIT) or none apply (ROLLBACK).

  • Atomicity — All or nothing.
  • Consistency — Constraints hold before and after.
  • Isolation — Concurrent sessions do not see each other’s half-finished work (levels: Read Committed, Repeatable Read, Serializable).
  • Durability — Committed data survives crash after WAL/fsync rules are met.
BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Deadlocks happen when two transactions lock rows in opposite order—retry with backoff in application code. Long transactions block vacuum/maintenance and hold connections—keep units of work short.

Views, procedures, and who owns business logic

A view is a stored SELECT; use it to simplify reporting and enforce column subsets. Materialized views store results (refresh on schedule) for heavy dashboards.

Stored procedures push logic into the database—good for batch ETL close to data; risky when product logic becomes hard to test in CI. Most web teams keep business rules in application code and use SQL for data access plus migrations.

Dialects: PostgreSQL vs MySQL (and cloud warehouses)

TopicPostgreSQLMySQL (InnoDB)
Auto-incrementSERIAL / GENERATED … IDENTITYAUTO_INCREMENT
BooleanNative BOOLEANTINYINT(1) convention
UpsertON CONFLICTON DUPLICATE KEY UPDATE
JSONJSONB + rich operatorsJSON type + functions
Full-texttsvector / GIN indexesFULLTEXT indexes
Vectors (AI)pgvector extensionVaries; often external store

On AWS, managed SQL maps to RDS / Aurora, Cloud SQL, Azure Database—see that mapping post when you place workloads on a hyperscaler.

SQL in data engineering and ML

Downstream of transactional databases, SQL reappears in:

  • Warehouses — Transform raw events into dimensions and facts (star schema).
  • dbt-style pipelines — Versioned SQL models with tests.
  • Feature stores — Point-in-time correct joins so training data does not leak future information.
  • RAG — Metadata filters plus vector search; PostgreSQL with pgvector is a common pattern (see How to become an AI developer).

For pipeline mindset and quality, pair this course with Data Engineering on AWS.

Course lab progression (suggested)

  1. Install PostgreSQL locally (or use Docker). Run psql and create the sample schema above.
  2. Write 20 SELECTs: filters, sorting, limits, simple aggregates.
  3. Add JOIN exercises: inner, left, anti-join with NOT EXISTS.
  4. Introduce CTEs; rewrite nested subqueries as WITH chains.
  5. One window-function worksheet (running totals, top-N per group with ROW_NUMBER).
  6. Transaction lab: transfer balance; force rollback; observe isolation with two sessions.
  7. Index lab: EXPLAIN ANALYZE before and after index on a 100k-row synthetic table.
  8. Capstone: design schema for a domain you know (inventory, tickets, courses); document PK/FK and three reporting queries.

Common pitfalls (exam and production)

  • SELECT * in application code — Breaks when columns are added; hides I/O cost.
  • Implicit joins — Comma-FROM without WHERE; use explicit JOIN syntax.
  • NULL in aggregatesCOUNT(*) vs COUNT(column); AVG ignores NULLs.
  • N+1 queries — ORM loads parent then one query per child; fix with join or batch load.
  • Over-indexing — Every index taxes writes; measure before adding five indexes “just in case.”
  • Logic only in the app — Missing FK/CHECK lets orphan rows into analytics forever.

What I took from the course

SQL is not a legacy checkbox—it is the contract between services and truth. The course worth finishing leaves you able to model data without obvious redundancy, query it clearly with joins and windows, and operate it with transactions and plans. Everything else—ORMs, warehouses, vector search—is built on that foundation.

Further reading

  • PostgreSQL documentation — tutorials, EXPLAIN, and JSONB
  • Use The Index, Luke — index and tuning mental models
  • SQL style guides (e.g. dbt, Mozilla) for team consistency
  • Mode Analytics SQL tutorial — window functions with exercises

Blog index · Data Engineering on AWS · How to become an AI developer

Back to blog list