Queries inside queries! Like Russian nesting dolls for data. Then CTEs โ the sticky notes that make complex queries readable.
Before we dive into subqueries and CTEs, let's set up some tables to work with. We'll use an employees table, a departments table, and an orders table โ the kind of data you'd find at any real company.
-- Create departments CREATE TABLE departments ( id SERIAL PRIMARY KEY, name VARCHAR(50) ); INSERT INTO departments (name) VALUES ('Engineering'), ('Marketing'), ('Sales'), ('HR'); -- Create employees CREATE TABLE employees ( id SERIAL PRIMARY KEY, name VARCHAR(100), department_id INTEGER REFERENCES departments(id), salary DECIMAL(10,2), manager_id INTEGER REFERENCES employees(id), hire_date DATE ); INSERT INTO employees (name, department_id, salary, manager_id, hire_date) VALUES ('Alice Chen', 1, 120000, NULL, '2019-03-15'), ('Bob Martinez', 1, 95000, 1, '2020-07-01'), ('Carol White', 1, 105000, 1, '2020-01-10'), ('Diana Lopez', 2, 88000, NULL, '2018-11-20'), ('Eve Johnson', 2, 72000, 4, '2021-06-05'), ('Frank Kim', 3, 110000, NULL, '2017-09-12'), ('Grace Patel', 3, 68000, 6, '2022-02-28'), ('Henry Davis', 3, 75000, 6, '2021-08-15'), ('Irene Nguyen', 4, 92000, NULL, '2019-05-01'), ('Jack Robinson', 4, 65000, 9, '2023-01-15'); -- Create orders CREATE TABLE orders ( id SERIAL PRIMARY KEY, employee_id INTEGER REFERENCES employees(id), amount DECIMAL(10,2), category VARCHAR(50), order_date DATE ); INSERT INTO orders (employee_id, amount, category, order_date) VALUES (6, 5000, 'Software', '2024-01-15'), (6, 12000, 'Hardware', '2024-02-20'), (7, 3500, 'Software', '2024-01-25'), (8, 8000, 'Hardware', '2024-03-10'), (7, 2500, 'Services', '2024-02-14'), (6, 15000, 'Software', '2024-03-01'), (8, 4500, 'Services', '2024-03-22'), (2, 9500, 'Hardware', '2024-01-30');
Here's a quick look at our employees data:
| id | name | department_id | salary | manager_id |
|---|---|---|---|---|
| 1 | Alice Chen | 1 (Engineering) | 120,000 | NULL |
| 2 | Bob Martinez | 1 (Engineering) | 95,000 | 1 |
| 3 | Carol White | 1 (Engineering) | 105,000 | 1 |
| 4 | Diana Lopez | 2 (Marketing) | 88,000 | NULL |
| 5 | Eve Johnson | 2 (Marketing) | 72,000 | 4 |
| 6 | Frank Kim | 3 (Sales) | 110,000 | NULL |
| 7 | Grace Patel | 3 (Sales) | 68,000 | 6 |
| 8 | Henry Davis | 3 (Sales) | 75,000 | 6 |
| 9 | Irene Nguyen | 4 (HR) | 92,000 | NULL |
| 10 | Jack Robinson | 4 (HR) | 65,000 | 9 |
See how manager_id points back to the same employees table? That's called a self-referencing foreign key. Employees with NULL manager_id are the top-level managers (department heads). This structure will be perfect for our recursive CTE example later!
Imagine your teacher asks: "Who scored higher than the class average?" To answer that, you need to do two things: first figure out what the class average is, then find everyone who beat it. A subquery is exactly that โ a question (query) inside another question (query). The inner question runs first and feeds its answer into the outer question.
A subquery is a SELECT statement nested inside another SQL statement. It's wrapped in parentheses and can appear in WHERE, FROM, SELECT, or even HAVING clauses.
Think of it like ordering food at a restaurant. You say: "I'll have whatever the most popular dish is." The waiter first has to go check what the most popular dish is (inner query), then bring it to you (outer query). You asked one question, but the waiter had to answer a hidden question first to fulfill your request.
Here's the simplest possible subquery โ finding employees who earn more than the company average:
SELECT name, salary FROM employees WHERE salary > ( SELECT AVG(salary) FROM employees );
How PostgreSQL runs this:
Result:
| name | salary |
|---|---|
| Alice Chen | 120,000 |
| Bob Martinez | 95,000 |
| Carol White | 105,000 |
| Frank Kim | 110,000 |
| Irene Nguyen | 92,000 |
The most common place for subqueries is inside the WHERE clause. There are two flavors:
A scalar subquery returns a single number or value โ like asking "What's the highest score?" The answer is just one number, say 98. Then you can use that number in a comparison like WHERE score > 98. If a subquery returns more than one value when you expected one, PostgreSQL will throw an error!
-- Find the employee(s) with the highest salary SELECT name, salary FROM employees WHERE salary = ( SELECT MAX(salary) FROM employees );
Result:
| name | salary |
|---|---|
| Alice Chen | 120,000 |
-- Find employees in the same department as Alice Chen SELECT name, department_id FROM employees WHERE department_id = ( SELECT department_id FROM employees WHERE name = 'Alice Chen' ) AND name <> 'Alice Chen';
Result:
| name | department_id |
|---|---|
| Bob Martinez | 1 |
| Carol White | 1 |
Instead of one answer, imagine the inner question returns a list. Like: "Which departments have more than 2 employees?" The answer is a list: (1, 3). Then the outer query says: "Give me all employees in departments (1, 3)." The IN keyword checks if a value is anywhere in that list.
-- Find employees in departments that have 3 or more people SELECT name, department_id FROM employees WHERE department_id IN ( SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) >= 3 );
Result:
| name | department_id |
|---|---|
| Alice Chen | 1 |
| Bob Martinez | 1 |
| Carol White | 1 |
| Frank Kim | 3 |
| Grace Patel | 3 |
| Henry Davis | 3 |
-- Find employees who have placed at least one order SELECT name FROM employees WHERE id IN ( SELECT DISTINCT employee_id FROM orders );
Result:
| name |
|---|
| Bob Martinez |
| Frank Kim |
| Grace Patel |
| Henry Davis |
If you use = with a subquery that returns multiple rows, PostgreSQL throws an error: "more than one row returned by a subquery used as an expression." Use = only when the subquery returns exactly ONE value. Use IN when it might return multiple values.
Imagine you do some math homework and write the answers on a sticky note. Now you use that sticky note as if it were a page in your textbook โ you look things up from it, compare numbers on it, etc. A derived table is exactly that: you write a query that creates a temporary result, and then you treat that result like it's a real table you can query from.
When you put a subquery in the FROM clause, the result becomes a temporary table. You must give it an alias (a name).
-- Find the top spender in each order category SELECT category_totals.category, e.name, category_totals.total_spent FROM ( -- This subquery becomes a temporary table called "category_totals" SELECT employee_id, category, SUM(amount) AS total_spent FROM orders GROUP BY employee_id, category ) AS category_totals JOIN employees e ON e.id = category_totals.employee_id ORDER BY category_totals.category, category_totals.total_spent DESC;
Result:
| category | name | total_spent |
|---|---|---|
| Hardware | Frank Kim | 12,000 |
| Hardware | Bob Martinez | 9,500 |
| Hardware | Henry Davis | 8,000 |
| Services | Henry Davis | 4,500 |
| Services | Grace Patel | 2,500 |
| Software | Frank Kim | 20,000 |
| Software | Grace Patel | 3,500 |
-- Compare each employee's salary to their department average SELECT e.name, e.salary, dept_avg.avg_salary, ROUND(e.salary - dept_avg.avg_salary, 2) AS diff_from_avg FROM employees e JOIN ( SELECT department_id, ROUND(AVG(salary), 2) AS avg_salary FROM employees GROUP BY department_id ) AS dept_avg ON e.department_id = dept_avg.department_id ORDER BY e.department_id, e.salary DESC;
PostgreSQL requires an alias for subqueries in FROM. Without one you'll get: "subquery in FROM must have an alias." Just add AS some_name after the closing parenthesis.
In a normal subquery, the inner question runs once and hands its answer to the outer query. But a correlated subquery is like a clingy little sibling โ it asks the outer query for help on every single row. Imagine going through a class roster and for each student asking: "Is THIS student's score higher than THEIR class average?" The "their class" part changes per student, so you have to recalculate each time.
A correlated subquery references columns from the outer query. This means it must run once for every row the outer query processes.
-- For each employee, compare salary to their OWN department average SELECT e.name, e.salary, e.department_id, ( SELECT ROUND(AVG(e2.salary), 2) FROM employees e2 WHERE e2.department_id = e.department_id -- references outer query! ) AS dept_avg_salary FROM employees e ORDER BY e.department_id;
Result:
| name | salary | department_id | dept_avg_salary |
|---|---|---|---|
| Alice Chen | 120,000 | 1 | 106,666.67 |
| Bob Martinez | 95,000 | 1 | 106,666.67 |
| Carol White | 105,000 | 1 | 106,666.67 |
| Diana Lopez | 88,000 | 2 | 80,000.00 |
| Eve Johnson | 72,000 | 2 | 80,000.00 |
| Frank Kim | 110,000 | 3 | 84,333.33 |
| Grace Patel | 68,000 | 3 | 84,333.33 |
| Henry Davis | 75,000 | 3 | 84,333.33 |
| Irene Nguyen | 92,000 | 4 | 78,500.00 |
| Jack Robinson | 65,000 | 4 | 78,500.00 |
PostgreSQL processes it row by row:
Inner query calculates AVG for department_id = 1 โ 106,666.67
Inner query calculates AVG for department_id = 1 โ 106,666.67 (same dept, same result)
Inner query recalculates AVG for department_id = 2 โ 80,000.00 (different dept!)
Because the inner query runs once per row, correlated subqueries can be painfully slow on large tables. If you have 1 million rows, the inner query runs 1 million times! In many cases, you can rewrite a correlated subquery as a JOIN with a derived table (like we did in the previous section) for much better performance. PostgreSQL's optimizer sometimes does this automatically, but don't count on it.
Imagine you're a teacher checking homework. For each student, you peek into the homework box and ask: "Is there ANY homework from this student in here?" You don't care what the homework says โ you just want a YES or NO. That's EXISTS. It doesn't return data; it returns true/false โ "does at least one matching row exist?"
-- Find employees who have placed at least one order SELECT e.name FROM employees e WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.employee_id = e.id );
Result:
| name |
|---|
| Bob Martinez |
| Frank Kim |
| Grace Patel |
| Henry Davis |
Inside EXISTS, it doesn't matter what you SELECT โ PostgreSQL only checks if any row is returned, not what columns it has. Writing SELECT 1 or SELECT * or even SELECT 'banana' all work identically. Convention is SELECT 1 to signal "I only care about existence."
-- Find employees who have NEVER placed an order SELECT e.name FROM employees e WHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.employee_id = e.id );
Result:
| name |
|---|
| Alice Chen |
| Carol White |
| Diana Lopez |
| Eve Johnson |
| Irene Nguyen |
| Jack Robinson |
| Feature | IN (subquery) | EXISTS |
|---|---|---|
| How it works | Runs inner query first, builds a list, then checks each outer row against the list | For each outer row, checks if at least one inner row matches |
| NULL safety | NOT IN breaks with NULLs! | NOT EXISTS handles NULLs correctly |
| Performance (large inner result) | Slower โ must materialize entire list | Faster โ can stop at first match |
| Performance (small inner result) | Usually similar | Usually similar |
| Readability | Simpler syntax | Slightly more verbose |
IN is like getting a full guest list and checking if your name is on it. EXISTS is like walking up to the bouncer who checks one name at a time. For a short guest list, both are fast. For a list with 100,000 names, the bouncer (EXISTS) can be faster because he stops the moment he finds your name โ he doesn't need to read the whole list first.
Imagine you're solving a big math problem. Instead of writing one enormous equation, you grab some sticky notes. On the first note you write: "Step A answer = 42." On the second note: "Step B answer = 15." Then for the final answer, you just look at your sticky notes: "Final = Step A + Step B = 57." CTEs are those sticky notes! You name each piece, calculate it, and then use those names in your final query. It's way cleaner than stuffing everything into one giant nested mess.
A CTE is defined using the WITH keyword. It creates a named temporary result that you can reference in the main query.
WITH cte_name AS ( -- Your query here (the "sticky note") SELECT ... ) -- Main query uses the CTE like a regular table SELECT ... FROM cte_name;
WITH avg_salary AS ( SELECT ROUND(AVG(salary), 2) AS company_avg FROM employees ) SELECT e.name, e.salary, a.company_avg, e.salary - a.company_avg AS above_avg_by FROM employees e, avg_salary a WHERE e.salary > a.company_avg ORDER BY above_avg_by DESC;
Result:
| name | salary | company_avg | above_avg_by |
|---|---|---|---|
| Alice Chen | 120,000 | 89,000 | 31,000 |
| Frank Kim | 110,000 | 89,000 | 21,000 |
| Carol White | 105,000 | 89,000 | 16,000 |
| Bob Martinez | 95,000 | 89,000 | 6,000 |
| Irene Nguyen | 92,000 | 89,000 | 3,000 |
You can chain multiple CTEs together separated by commas:
WITH dept_stats AS ( -- Sticky note 1: department salary stats SELECT department_id, COUNT(*) AS emp_count, ROUND(AVG(salary), 2) AS avg_salary, MAX(salary) AS max_salary FROM employees GROUP BY department_id ), order_stats AS ( -- Sticky note 2: total orders per department SELECT e.department_id, COUNT(o.id) AS total_orders, COALESCE(SUM(o.amount), 0) AS total_revenue FROM employees e LEFT JOIN orders o ON o.employee_id = e.id GROUP BY e.department_id ) -- Main query: combine both sticky notes SELECT d.name AS department, ds.emp_count, ds.avg_salary, ds.max_salary, os.total_orders, os.total_revenue FROM dept_stats ds JOIN departments d ON d.id = ds.department_id JOIN order_stats os ON os.department_id = ds.department_id ORDER BY os.total_revenue DESC;
Result:
| department | emp_count | avg_salary | max_salary | total_orders | total_revenue |
|---|---|---|---|---|---|
| Sales | 3 | 84,333.33 | 110,000 | 7 | 50,500 |
| Engineering | 3 | 106,666.67 | 120,000 | 1 | 9,500 |
| Marketing | 2 | 80,000.00 | 88,000 | 0 | 0 |
| HR | 2 | 78,500.00 | 92,000 | 0 | 0 |
In a chain of CTEs, each one can reference any CTE defined before it. So order_stats could use dept_stats if needed. This lets you build up complex logic step by step, like a recipe where each step builds on the previous one.
Imagine you want to draw your family tree. You start with yourself. Then you ask: "Who are my parents?" For each parent, you ask: "Who are THEIR parents?" And you keep asking until you run out of ancestors. A recursive CTE does the same thing โ it starts with a base case (you), then keeps expanding by finding related rows, until there's nothing more to find. It's perfect for any tree-like or hierarchical data: org charts, category trees, file systems, etc.
WITH RECURSIVE cte_name AS ( -- BASE CASE: the starting point (the "root") SELECT ... FROM table WHERE ... -- condition to find the root(s) UNION ALL -- RECURSIVE CASE: find children of the current level SELECT ... FROM table JOIN cte_name ON ... -- join back to itself! ) SELECT * FROM cte_name;
WITH RECURSIVE org_chart AS ( -- BASE CASE: start with top-level managers (no manager) SELECT id, name, manager_id, 1 AS level, name::TEXT AS path FROM employees WHERE manager_id IS NULL UNION ALL -- RECURSIVE CASE: find employees managed by the current level SELECT e.id, e.name, e.manager_id, oc.level + 1, oc.path || ' โ ' || e.name FROM employees e JOIN org_chart oc ON e.manager_id = oc.id ) SELECT REPEAT(' ', level - 1) || name AS org_tree, level, path FROM org_chart ORDER BY path;
Result:
| org_tree | level | path |
|---|---|---|
| Alice Chen | 1 | Alice Chen |
| Bob Martinez | 2 | Alice Chen โ Bob Martinez |
| Carol White | 2 | Alice Chen โ Carol White |
| Diana Lopez | 1 | Diana Lopez |
| Eve Johnson | 2 | Diana Lopez โ Eve Johnson |
| Frank Kim | 1 | Frank Kim |
| Grace Patel | 2 | Frank Kim โ Grace Patel |
| Henry Davis | 2 | Frank Kim โ Henry Davis |
| Irene Nguyen | 1 | Irene Nguyen |
| Jack Robinson | 2 | Irene Nguyen โ Jack Robinson |
Finds 4 rows: Alice, Diana, Frank, Irene (manager_id IS NULL). These are level 1.
Finds employees whose manager_id matches the ids from iteration 0. Bob & Carol (under Alice), Eve (under Diana), Grace & Henry (under Frank), Jack (under Irene). These are level 2.
Looks for employees managed by level 2 employees. Finds NOTHING โ no one reports to Bob, Carol, Eve, etc. Recursion stops!
Recursive CTEs can generate data from nothing โ no table needed!
-- Generate numbers 1 to 10 WITH RECURSIVE nums AS ( SELECT 1 AS n -- Base case: start at 1 UNION ALL SELECT n + 1 -- Recursive case: add 1 FROM nums WHERE n < 10 -- Stop condition! ) SELECT n FROM nums;
Without a WHERE clause to stop the recursion, your query will run forever (until PostgreSQL's max_recursion limit kicks in, defaulting to about 100 iterations, and throws an error). Always make sure the recursive case has a condition that will eventually return zero rows.
Both CTEs and subqueries can solve the same problems. So when should you use each?
| Criteria | Subquery | CTE (WITH) |
|---|---|---|
| Readability | Gets messy with nesting | Clean, top-down, named blocks |
| Reuse | Must duplicate the subquery | Define once, reference multiple times |
| Debugging | Hard to test inner queries alone | Easy โ comment out the main query and run the CTE |
| Recursion | Not possible | Only CTEs support recursion |
| Performance (PG 12+) | Always inlined | Inlined by default (same as subquery) |
| Force materialization | Not possible | WITH ... AS MATERIALIZED |
Subqueries: Great for simple, one-off "questions inside questions." If it's a single line or small enough to read at a glance, a subquery is fine.
CTEs: Use when the query is complex, when you need to reference the same result more than once, or when you need recursion. Think of it as: "If I need a sticky note to keep track of things, I should use a CTE."
-- SUBQUERY VERSION: nested and harder to read SELECT e.name, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees ) AND e.department_id IN ( SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) >= 3 ); -- CTE VERSION: clean and readable WITH company_avg AS ( SELECT AVG(salary) AS avg_sal FROM employees ), big_departments AS ( SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) >= 3 ) SELECT e.name, e.salary FROM employees e, company_avg c WHERE e.salary > c.avg_sal AND e.department_id IN (SELECT department_id FROM big_departments);
In professional codebases, CTEs are the preferred choice for anything beyond trivial subqueries. They make SQL code self-documenting because each CTE has a meaningful name. When you come back to a query 6 months later, WITH monthly_revenue AS (...) is infinitely clearer than a mysterious nested subquery buried three levels deep.
Subqueries are like whispering an instruction mid-sentence: "Give me employees who earn more than (quick โ what's the average salary?) the average." CTEs are like writing notes on a whiteboard first: "Step 1: Average salary = $89K. Step 2: Find employees above $89K." Both get you the same answer, but the whiteboard version is what you'd use in a presentation.
Time to put it all together! Try these yourself before revealing the solutions.
Using a subquery, find all employees who earn more than the average salary of the Sales department (department_id = 3). Show their name, salary, and department_id.
SELECT name, salary, department_id FROM employees WHERE salary > ( SELECT AVG(salary) FROM employees WHERE department_id = 3 );
Using EXISTS, find departments that have at least one employee who has placed an order. Show the department name.
SELECT d.name FROM departments d WHERE EXISTS ( SELECT 1 FROM employees e JOIN orders o ON o.employee_id = e.id WHERE e.department_id = d.id );
Returns: Engineering, Sales
Write a query using two CTEs: one for department headcount and one for department order totals. Join them to show department name, headcount, and total order amount.
WITH headcount AS ( SELECT department_id, COUNT(*) AS emp_count FROM employees GROUP BY department_id ), dept_orders AS ( SELECT e.department_id, COALESCE(SUM(o.amount), 0) AS total_amount FROM employees e LEFT JOIN orders o ON o.employee_id = e.id GROUP BY e.department_id ) SELECT d.name, h.emp_count, do.total_amount FROM departments d JOIN headcount h ON h.department_id = d.id JOIN dept_orders do ON do.department_id = d.id ORDER BY do.total_amount DESC;
Using a recursive CTE, find the full management chain for Grace Patel (id = 7). Start from Grace and walk UP to her manager, then her manager's manager, etc.
WITH RECURSIVE chain AS ( -- Start with Grace SELECT id, name, manager_id, 1 AS level FROM employees WHERE id = 7 UNION ALL -- Walk UP to the manager SELECT e.id, e.name, e.manager_id, c.level + 1 FROM employees e JOIN chain c ON e.id = c.manager_id ) SELECT name, level FROM chain;
Returns: Grace Patel (1), Frank Kim (2)
You now know how to write queries inside queries, name them with CTEs for clarity, and even walk hierarchical data with recursive CTEs. Here's a summary:
| Concept | What It Does | Key Syntax |
|---|---|---|
| Scalar Subquery | Returns one value | WHERE col = (SELECT ...) |
| IN Subquery | Returns a list of values | WHERE col IN (SELECT ...) |
| Derived Table | Subquery as a table in FROM | FROM (SELECT ...) AS alias |
| Correlated Subquery | References outer query per row | WHERE col = (SELECT ... WHERE inner.x = outer.x) |
| EXISTS / NOT EXISTS | Checks for existence of rows | WHERE EXISTS (SELECT 1 ...) |
| CTE | Named temporary result | WITH name AS (SELECT ...) |
| Recursive CTE | Walks hierarchies | WITH RECURSIVE name AS (...) |
You've learned to write powerful, complex queries. But how do you make them fast? In the next module, you'll learn about indexes โ the secret weapon that turns slow queries taking 30 seconds into lightning-fast lookups completing in milliseconds. You'll also learn to read execution plans with EXPLAIN ANALYZE, so you can see exactly what PostgreSQL does behind the scenes. It's like getting X-ray vision for your database!