Imagine you're building the world's smartest filing cabinet. That's data modelling! We'll teach you the Kimball way to organize data so anyone can find answers in seconds.
Based on Kimball Dimensional Modeling Techniques by Ralph Kimball & Margy Ross | Explained simply by Fakhruddin Khambaty
A Star Schema - The heart of dimensional modelling. Fact table in the center, dimensions around it like a star!
Click any section to jump straight there!
The big picture of data modelling
Where the numbers live
The descriptive labels
Integration across the enterprise
Handling change over time (Types 0-7)
Levels and drill-downs
Pro techniques for fact tables
Expert dimension patterns
Real-time, errors, and more
Before we build anything, we need to understand WHY and HOW we organize data. Think of this as the blueprint before building a house.
Before you write a single line of SQL, you need to do two critical things:
Step 1: Talk to the Business People - Sit down with managers, analysts, and decision-makers. Ask them: What decisions do you need to make? What numbers (KPIs) do you look at every day? What questions keep you up at night?
Step 2: Talk to the Data People - Meet with the engineers who manage the source systems. Look at what data actually exists. Profile the data - how clean is it? What's missing? What are the volumes?
At a Retail Store: The Marketing VP says "I want to know which products sell best on weekends." The data team checks: "We have POS (point-of-sale) data with timestamps, product codes, and store IDs. But the product descriptions are messy - some say 'Coca Cola' and others say 'Coke 330ml'." Now you know what the business wants AND what the data reality looks like!
Dimensional models should NEVER be designed by a lone genius in a dark room. They must be designed together with the business people who actually use the data.
The data modeller leads the workshops, but subject matter experts from the business AND data governance representatives MUST participate. Models designed in isolation by techies who don't understand the business will fail!
Every dimensional model follows these exact 4 steps. Always. No exceptions. This is the recipe!
The 4-Step Dimensional Design Process - Follow this recipe every single time!
A business process is an operational activity that your organization does repeatedly. It's the "thing that happens" that generates data.
Each business process gets its own fact table. Don't try to cram "sales" and "inventory" into the same table - they happen at different times, with different measurements and different detail levels. Each process = one row in the Enterprise Bus Matrix.
The grain defines exactly what ONE row in your fact table represents. This is the single most important decision you'll make. Get it wrong, and everything falls apart.
NEVER mix different grains in the same fact table! If one row represents "a single item sold" and another represents "daily store total," your math will be completely wrong. Different grains = different tables. Period.
Dimensions answer the journalist's questions: Who? What? Where? When? Why? How? They contain all the descriptive text labels that make your numbers meaningful.
Dimension tables are sometimes called the "soul" of the data warehouse because they're what users actually see and interact with. When someone filters by "Category = Electronics" or groups by "Region = Middle East" - they're using dimensions!
Each dimension should be single-valued for a given fact row. If a patient has multiple diagnoses, that's a special case (multivalued dimension) that needs a bridge table - we'll cover that in advanced topics!
Facts are the numeric measurements that result from a business event. They're almost always numbers - things you can add up, average, count, etc.
Only facts consistent with the declared grain are allowed! If your grain is "one product sold in one transaction," you can include quantity_sold and item_price. But the store manager's salary? Nope - that doesn't match this grain!
A Star Schema is the physical layout of your dimensional model in a database. It's called a "star" because when you draw it, the fact table sits in the center and dimension tables radiate outward like the points of a star.
OLAP Cubes are similar in content but live in multidimensional databases. They allow more powerful analytical queries than regular SQL. Think of OLAP cubes as the "3D version" of a star schema - same data, fancier analysis tools.
Star schemas are simple for business users to understand, fast for databases to query (fewer JOINs!), and flexible enough to handle virtually any business question. That's why they've been the gold standard since 1996!
One of the best things about dimensional models is they're resilient to change. When business needs evolve, you can extend without breaking anything!
Here are the 4 changes you can make without breaking any existing queries:
What does "grain" mean in dimensional modelling?
Fact tables are where the numbers live. Let's learn how to build them right, with all their variations and gotchas!
A fact table contains numeric measures from real-world events. Each row is one measurement event. Think of it as a receipt - each line is something that actually happened.
Every fact table has:
A typical fact table: foreign keys at the top, numeric measures below. FK = Foreign Key, DD = Degenerate Dimension
The design of a fact table is based entirely on what physically happens in the real world - NOT on what reports people want. If you design for reports, the model becomes brittle and breaks when someone asks a new question!
Not all numbers can be added up the same way. This is one of the most important distinctions in fact tables!
| Type | Can Sum Across... | Examples | Tip |
|---|---|---|---|
| Additive | ALL dimensions | Revenue, Quantity, Cost | Best kind! Most flexible |
| Semi-Additive | SOME dimensions (not time) | Account Balance, Inventory Count | Use AVG or latest for time |
| Non-Additive | NONE (can't meaningfully sum) | Ratios, Percentages, Unit Prices | Store components, calculate later |
For non-additive facts like profit margin, store the additive components instead (revenue and cost) and let the BI tool calculate the ratio at the end. This way you can always roll up the components correctly!
Null measures (fact columns) are FINE. SUM, COUNT, MIN, MAX, AVG all handle nulls gracefully - they just skip them.
Null foreign keys are NOT fine. They break referential integrity. Instead, create a special "Unknown" or "Not Applicable" row in the dimension table with its own surrogate key.
When the same measurement appears in different fact tables, it MUST be defined identically if you want to compare them. Same formula, same units, same business rules.
There are exactly 3 fundamental types of fact tables. Every fact table in the universe is one of these:
One row = one event at a point in time. Like a receipt line item.
POS (Point of Sale) transactions, website clicks, ATM withdrawals, insurance claim filings
One row = a summary over a standard time period. Like a daily/weekly/monthly report card.
Daily inventory levels, monthly account balances, weekly sales summaries. These are always dense - every row exists even with zero activity.
One row = an entire process from start to finish, updated as milestones are reached.
Order fulfillment pipeline, insurance claim processing, loan application workflow. Multiple date foreign keys track each milestone. Rows get UPDATED (unique among fact tables!).
Wait - a fact table with NO facts? Yes! Sometimes events don't have numbers, but the event itself is worth recording.
Coverage tables record what COULD happen (all students enrolled in all classes). Activity tables record what DID happen. Subtract one from the other to find what DIDN'T happen!
Aggregate fact tables are pre-calculated summaries built to speed up queries. Think of them as cheat sheets - they hold pre-computed totals so the database doesn't have to crunch billions of rows every time.
Consolidated fact tables combine facts from multiple processes into one table WHEN they share the same grain. For example, putting actual sales and forecasted sales side by side makes comparison easy and fast.
Dimensions are the "soul" of the warehouse. They're what users see, filter, and group by. Let's master them!
Dimension tables are wide, flat, and denormalized. They have one primary key and LOTS of descriptive text columns. They're the opposite of what you'd design in a normalized OLTP system!
A typical dimension table: wide, flat, with many descriptive text columns. Brand, Category, Department all in ONE table (denormalized)!
Use verbose, meaningful descriptions - not cryptic codes! Instead of "status = A", store "status = Active." Users should be able to read dimension values without needing a decoder ring.
Every dimension table gets an anonymous integer primary key - we call it a surrogate key. Simple numbers: 1, 2, 3, 4... That's it!
Why not use natural keys? Because natural keys (like employee ID "EMP-001") can change, come from multiple systems, be poorly managed, and when you track history with SCD Type 2, one employee will have MULTIPLE rows - so the natural key isn't unique anymore!
The Date dimension can use a meaningful key like 20250216 (YYYYMMDD format) instead of a surrogate key, because dates are stable and predictable. But still add a special row for "Unknown Date"!
Three types of keys you need to understand:
The durable key is important when a natural key can change. Example: An employee quits and is rehired - they get a NEW employee number, but the warehouse wants to track them as the SAME person. The durable key stays constant forever.
Drilling down simply means adding more detail to a report. In SQL terms, you add another column to your GROUP BY clause.
You don't need pre-defined drill paths! Users can drill down using ANY dimension attribute from ANY dimension attached to the fact table. Want to drill from Year to Product Brand? Go for it!
A degenerate dimension is a dimension key that lives directly in the fact table with no corresponding dimension table. The most common example: invoice numbers, order numbers, ticket numbers.
In dimensional modelling, we deliberately denormalize. Instead of having separate tables for Product → Brand → Category → Department, we smash them all into ONE flat dimension table.
Resist the urge to normalize! If you've spent years in OLTP database design, denormalization will feel "wrong." But in dimensional modelling, it's absolutely the RIGHT approach. Simplicity + Speed = Happy Users!
Multiple Hierarchies: A single dimension can have many hierarchies living side by side. A Date dimension has Day→Week→Fiscal Period AND Day→Month→Year. A Geography dimension has City→State→Country AND City→Sales Region→Division. They all coexist peacefully in one table!
Flags and Indicators: Replace cryptic codes with readable text! Instead of "status = T/F", use "is_active = Yes/No". Instead of "type = A", use "customer_type = Active Premium Member". When codes have embedded meaning (like "NYC-RET-001"), break them into separate columns: city = New York, type = Retail, sequence = 001.
Null Attributes: Avoid NULL in dimension attributes! Use descriptive strings like "Unknown" or "Not Applicable" instead. Different databases handle NULLs in GROUP BY and WHERE clauses inconsistently, which leads to bugs.
The date dimension is the most common dimension - attached to virtually every fact table. It's a pre-built calendar that makes date-based analysis a breeze.
A date dimension - pre-computed calendar attributes. The primary key can be YYYYMMDD for easy partitioning.
If you need time-of-day precision, add a separate timestamp column to the fact table. If users filter by day parts (morning shift vs. evening shift), create a separate time-of-day dimension.
One physical dimension table used multiple times in the same fact table. Example: An order has an "order date," a "ship date," and a "delivery date" - all pointing to the SAME date dimension through different foreign keys. Each reference is a separate "role" (view with unique column names).
Miscellaneous low-cardinality flags (payment method, order priority, gift wrap yes/no) combined into a single "transaction profile" dimension instead of creating dozens of tiny separate dimensions.
When dimension hierarchies are normalized into separate tables (Product → Brand table → Category table). Called "snowflake" because it looks like a snowflake instead of a star.
AVOID snowflaking! It makes queries harder to write, confuses business users, and can hurt performance. A flat denormalized dimension contains exactly the same information and is much easier to use.
A dimension that references another dimension. Example: A bank account dimension has an "account_open_date" that points to the date dimension. Use sparingly! In most cases, put both dimensions as separate foreign keys in the fact table instead.
This is how you connect different parts of the business together. The secret sauce of enterprise data warehousing!
Conformed dimensions are the glue that holds the entire enterprise data warehouse together. When the same dimension (e.g., Customer, Product, Date) is shared across multiple fact tables with identical column names and values, you can combine reports across different business processes.
When you use the same conformed dimension as a row header in a drill-across report, results from separate fact tables align perfectly on the same rows. This is what makes enterprise analytics possible. Build once, reuse everywhere!
Shrunken Rollup Dimensions are subsets of a full dimension - fewer rows and/or columns. Used for aggregate fact tables or when a business process captures data at a higher level (e.g., forecasts by month+brand instead of day+product).
Drilling Across means querying two or more fact tables separately, then merging the results on common conformed dimension attributes. The key: each query runs independently, and the answers are aligned by sort-merging on shared row headers.
The Value Chain is the natural flow of business processes in your organization. For retail: Purchasing → Warehousing → Retail Sales. For finance: Budgeting → Commitments → Payments. Each step produces its own fact table.
The Enterprise Data Warehouse Bus Architecture provides an incremental, manageable approach to building the warehouse. It says: "Focus on one business process at a time, but make sure your dimensions are conformed across all of them." This way, each piece works alone AND integrates with everything else.
The Bus Matrix: rows = business processes, columns = dimensions. Shared dimensions enable cross-process analytics!
A variation of the Bus Matrix where dimension columns are replaced with business functions (Marketing, Sales, Finance, Operations). The shaded cells show which business groups care about which processes.
Data changes over time. People move, products get renamed, prices change. How do we handle this? With 8 different techniques!
The value NEVER changes. Whatever was first recorded stays forever.
"Original credit score", date of birth, first hire date. Things labeled "original" that should never be overwritten.
Simply overwrite the old value with the new one. History is destroyed.
The MOST important SCD! Creates a NEW row with a new surrogate key. Old row is marked as expired. Full history preserved!
Row 1: Sarah, Dubai, effective 2020-01-01, expired 2025-01-15, current=N
Row 2: Sarah, Abu Dhabi, effective 2025-01-16, expired 9999-12-31, current=Y
Three extra columns: effective_date, expiration_date, current_row_indicator
Add a NEW column to store the old value. One row, but with both old and new values side by side. Called "alternate reality."
One row: Sarah | current_city = Abu Dhabi | previous_city = Dubai. Users can filter by either! Rarely used because it only tracks ONE change.
For rapidly changing attributes in huge dimensions. Split the fast-changing columns into a separate "mini-dimension" table.
Builds on Type 4: the mini-dimension is also embedded as a Type 1 outrigger in the base dimension, so you can see CURRENT mini-dimension values without going through the fact table.
Adds Type 1 (current value) columns to a Type 2 dimension. Every row has both the "as was" value and the "as is" value. Filter by either one!
Row 1: Sarah, historical_city=Dubai, current_city=Abu Dhabi, eff=2020
Row 2: Sarah, historical_city=Abu Dhabi, current_city=Abu Dhabi, eff=2025
The "current_city" is overwritten (Type 1) on ALL rows when it changes.
The fact table has TWO keys to the same dimension: a surrogate key (for Type 2 "as-was" view) AND a durable key (for Type 1 "as-is" view). Two different views of the same dimension for maximum flexibility.
A customer changes their address. You need to see what their address was at the time of each order AND their current address. Which SCD type?
From simple Product→Brand→Category to complex organizational structures with variable depth!
A fixed depth hierarchy has a predictable, consistent number of levels with agreed-upon names. Example: Product → Brand → Category → Department. Always exactly 4 levels, always the same names.
Fixed depth hierarchies are BY FAR the easiest to query and navigate. If your hierarchy fits this pattern, always use positional attributes in a flat dimension table. Don't overcomplicate it!
Sometimes the hierarchy depth varies a little bit - like geography: some places have City → State → Country (3 levels), while others have City → Province → Region → Country (4 levels).
The really tricky ones! Organization charts, bill of materials, account charts - where the depth is completely unpredictable. Might be 2 levels, might be 20.
Solution 1: Bridge Table - A special helper table that stores every possible path through the hierarchy. One row for each "from-to" path. Enables all hierarchy queries with standard SQL.
Solution 2: Pathstring Attribute - A single column storing the full path as a string: "CEO|VP Sales|Director East|Manager NY". Can handle many hierarchy queries with LIKE operators, but less flexible for substituting alternative hierarchies.
Ragged hierarchies are genuinely difficult in relational databases. SQL recursive queries (WITH RECURSIVE) have limitations. The bridge table approach is the most robust solution, though it requires extra ETL work.
Pro-level patterns for handling complex measurement scenarios. These are the techniques that separate beginners from experts!
While not required, adding a single surrogate key column to fact tables is useful. It serves as a simple primary key, helps identify rows during ETL, allows interrupted loads to resume, and converts risky UPDATE operations into safer INSERT+DELETE pairs.
A centipede fact table has dozens of foreign keys because someone created separate dimensions for every level of a hierarchy (date dim, month dim, quarter dim, year dim - all as separate foreign keys). DON'T DO THIS! Collapse them into one dimension at the lowest grain (just the date dimension).
Some numbers are ambiguous - is "list price" a fact or a dimension attribute? Here's the rule:
In accumulating snapshots with multiple milestones, users want to analyze time lags between steps. Smart approach: store ONE lag per step (measured from the start), then any lag between two steps is a simple subtraction.
Header/Line: Transactions often have a header (the order) and lines (individual items). In the fact table, include ALL header-level dimension keys on every line-level row. Don't force users to join back to a header table!
Allocated Facts: When a fact exists at the header level (like shipping cost for the whole order), allocate it down to the line level proportionally. This way, every fact can be sliced by every dimension.
Profit & Loss Tables: Revenue - Costs = Profit at atomic grain. Incredibly powerful but politically charged since cost allocation requires executive buy-in. Not a first-phase project!
Multiple Currencies: Store PAIRS of columns: one in the transaction's original currency, one in a standard currency (USD). Add a currency dimension to identify the original currency.
Multiple Units of Measure: Store facts once in a standard unit, plus conversion factors for all other units. Deploy views for each user group with their preferred unit. The conversion factors MUST live in the fact row for accuracy.
Year-to-Date: Don't store YTD in fact tables! Calculate it in the BI layer instead. YTD is a moving target - "YTD as of fiscal period end" is different from "YTD as of today" - and storing it creates maintenance nightmares.
Multipass SQL (NEVER join fact-to-fact!): If you need to compare shipments and returns, DON'T join the two fact tables directly. It creates uncontrollable row explosions with wrong results. Instead, query each fact table separately and merge the results on common dimension attributes. This is called "drill-across."
Joining two fact tables across foreign keys is one of the most common and dangerous mistakes in dimensional modelling. The cardinality explodes and results are WRONG. Always use drill-across (separate queries, then merge)!
Timespan Tracking: In rare cases, add effective/expiration dates to fact rows (like SCD Type 2 for facts). Useful for slowly changing balances where periodic snapshots would create identical duplicate rows.
Late Arriving Facts: When a fact row arrives late (delayed data feed), you need to find the dimension keys that were effective at the TIME the event actually occurred, not when it arrived. Search the dimension's history to find the right surrogate keys.
Expert-level patterns for complex real-world scenarios. Multivalued relationships, behavior tracking, and more!
When a dimension references another (via outrigger), Type 2 changes in the outrigger can force explosive Type 2 growth in the base dimension. Solution: demote the relationship to the fact table, where both dimensions are separate foreign keys.
Normally, each dimension has ONE value per fact row. But sometimes there are legitimately MULTIPLE values - like a patient with 3 simultaneous diagnoses.
Bridge tables may also need effective/expiration dates (like Type 2 SCD) when the multivalued relationships change over time, such as bank accounts linked to multiple customers.
Behavior Tags: Data mining produces textual behavior labels (like "High Spender", "Bargain Hunter", "Churning") assigned periodically. Store these as positional attributes in the customer dimension: behavior_jan, behavior_feb, etc. They're used for pattern matching, not math.
Study Groups: Complex behavior analysis (like "customers who bought Product A, then returned Product B within 30 days, and haven't bought in 90 days") is too complex for normal queries. Run the analysis once, save the results in a simple table of customer durable keys, and use that table as a filter in any schema.
Aggregated Facts as Dimension Attributes: "Show me only customers who spent over $10,000 last year." To enable this, pre-calculate aggregated metrics and store them in the dimension table, often as banded ranges: "Lifetime Value = $10K-$25K".
Dynamic Value Banding: Reports grouped by value ranges ("Balance $0-$10", "$10-$25", etc.) where the ranges are defined at query time using a small banding dimension table joined via greater-than/less-than conditions, or SQL CASE statements.
Text Comments: Free-form text comments don't belong in fact tables! Store them in a separate comments dimension with a foreign key from the fact table. This keeps the fact table lean and fast.
Multiple Time Zones: Use dual foreign keys in the fact table - one for UTC standard time and one for local time, each pointing to role-playing date dimensions.
Measure Type Dimensions: When a fact table has hundreds of sparsely populated fact columns, you MIGHT collapse them into a single "measure_value" column identified by a "measure_type" dimension. Generally NOT recommended because it multiplies fact table rows and makes cross-column math hard. Only use when facts number in the hundreds.
Step Dimensions: For sequential processes (web page flows), a step dimension tracks where each step fits: step_number=3, total_steps=7. Helps analyze drop-off points!
Hot Swappable Dimensions: Same fact table, different dimension versions for different users. A stock ticker fact table exposed to multiple investors, each with their own proprietary attributes on the same stocks.
Abstract Generic Dimensions (AVOID!): Don't create a single "Person" dimension for employees, customers, and vendors. Don't create a generic "Location" dimension for stores, warehouses, and customer addresses. These have different attributes and create large, confusing tables. Keep them separate!
Data abstraction is fine in OLTP source systems and ETL processing, but it HURTS query performance and legibility in dimensional models. Keep your dimensions specific and meaningful!
Audit Dimensions: Track ETL metadata! When a fact row is created, stamp it with: data quality score, ETL code version, processing timestamp, source system identifier. Invaluable for compliance and debugging.
Late Arriving Dimensions: When facts arrive before their dimension context (real-time systems), create placeholder dimension rows with the natural key and "Unknown" for descriptive attributes. When the real data arrives later, overwrite with Type 1. For retroactive Type 2 changes, insert a new historical dimension row and restate affected fact rows.
Unique patterns for heterogeneous products, real-time data, and data quality management!
When a business offers wildly different products (checking accounts, mortgages, business loans - all "accounts" but completely different), a single fact table with every possible column is impossible.
Solution: Build a supertype (core) fact table with common facts shared by ALL account types (balance, transaction count), plus separate subtype (custom) fact tables for each account type with their unique facts. Same pattern for dimensions: one core dimension with common attributes, plus subtype-specific dimension tables.
Supertype/Subtype pattern: One core table for shared metrics, separate tables for product-specific details
When the nightly batch isn't fast enough and you need data updated in real-time (or near real-time). Techniques include:
Data quality requires a system of screens/filters that test data as it flows from source to warehouse. When errors are detected, they're recorded in a special dimensional schema in the ETL back room.
Two related tables:
You've just covered ALL the Kimball Dimensional Modeling Techniques - from basic star schemas to advanced bridge tables and error event schemas. That's the complete toolkit used by data warehouse professionals worldwide since 1996!