Data Modelling - Kimball Dimensional Modeling

1

Fundamental Concepts

Before we build anything, we need to understand WHY and HOW we organize data. Think of this as the blueprint before building a house.

Gather Business Requirements & Data Realities

Before you write a single line of SQL, you need to do two critical things:

🏠

Imagine you want to build a house. You don't just grab bricks and start stacking! First, you ask the family: "How many bedrooms? Do we need a garage? How big is the kitchen?" Then you check the land: "Is it flat? Is there a river nearby? Can we build 3 stories?" That's exactly what we do with data modelling!

Step 1: Talk to the Business People - Sit down with managers, analysts, and decision-makers. Ask them: What decisions do you need to make? What numbers (KPIs) do you look at every day? What questions keep you up at night?

Step 2: Talk to the Data People - Meet with the engineers who manage the source systems. Look at what data actually exists. Profile the data - how clean is it? What's missing? What are the volumes?

Real-World Example

At a Retail Store: The Marketing VP says "I want to know which products sell best on weekends." The data team checks: "We have POS (point-of-sale) data with timestamps, product codes, and store IDs. But the product descriptions are messy - some say 'Coca Cola' and others say 'Coke 330ml'." Now you know what the business wants AND what the data reality looks like!

Collaborative Dimensional Modeling Workshops

Dimensional models should NEVER be designed by a lone genius in a dark room. They must be designed together with the business people who actually use the data.

🍕

Think of ordering pizza for a big party. If one person decides everything alone, half the guests won't like it! Instead, you get everyone together: "Who's vegetarian? Who wants spicy? Any allergies?" The data modeller is like the pizza organizer - they lead the session, but everyone's voice matters.

Key Rule

The data modeller leads the workshops, but subject matter experts from the business AND data governance representatives MUST participate. Models designed in isolation by techies who don't understand the business will fail!

The Four-Step Dimensional Design Process

Every dimensional model follows these exact 4 steps. Always. No exceptions. This is the recipe!

Select the Business Process - What activity are we modelling? (e.g., "retail sales", "insurance claims", "student registration")
Declare the Grain - What does ONE row in our fact table represent? (e.g., "one product sold in one transaction" or "one daily balance per account")
Identify the Dimensions - What are the descriptive "who, what, where, when, why, how" labels? (e.g., Date, Product, Store, Customer)
Identify the Facts - What are the numeric measurements? (e.g., quantity sold, dollar amount, discount percentage)

📝

It's like writing a news article! Every journalist follows: What happened? (Process) How detailed? (Grain) The 5 W's? (Dimensions) The numbers? (Facts). "A customer (WHO) bought 3 bottles of milk (WHAT/HOW MANY) at Store #42 (WHERE) on Monday (WHEN) using a 10% coupon (WHY) for $8.97 (MEASUREMENT)."

The 4-Step Dimensional Design Process - Follow this recipe every single time!

Business Processes

A business process is an operational activity that your organization does repeatedly. It's the "thing that happens" that generates data.

🏪

Think of a grocery store. What "things happen" there every day? Customers check out (that's a process!), deliveries arrive (another process!), employees clock in/out (another one!). Each of these is a business process, and each one generates its own unique data that deserves its own fact table.

Examples of Business Processes

Retail: Taking an order, processing a return, receiving inventory
Insurance: Filing a claim, approving a policy, paying out
University: Registering students, grading exams, awarding degrees
Banking: Opening an account, depositing money, snapshotting balances monthly

Important Rule

Each business process gets its own fact table. Don't try to cram "sales" and "inventory" into the same table - they happen at different times, with different measurements and different detail levels. Each process = one row in the Enterprise Bus Matrix.

Grain - The Most Critical Decision!

The grain defines exactly what ONE row in your fact table represents. This is the single most important decision you'll make. Get it wrong, and everything falls apart.

📸

Think of a camera. The "grain" is your zoom level. Are you taking a photo of the entire city? (summary grain) Or zooming into one person's face? (atomic grain). In data modelling, we almost always want the most zoomed-in level possible - called atomic grain - because you can always zoom out from a detailed photo, but you can never zoom INTO a blurry one!

Grain Examples

Atomic grain: "One product line item in one transaction" (most detailed!)
Summary grain: "Total sales per store per day" (rolled up - less flexible)
Snapshot grain: "Account balance at end of each month"

⚠️

NEVER mix different grains in the same fact table! If one row represents "a single item sold" and another represents "daily store total," your math will be completely wrong. Different grains = different tables. Period.

Dimensions for Descriptive Context

Dimensions answer the journalist's questions: Who? What? Where? When? Why? How? They contain all the descriptive text labels that make your numbers meaningful.

🏷️

Imagine someone tells you: "We sold 47 units for $329.53." That's useless without context! But add dimensions: "We sold 47 units of Organic Almond Milk (Product) at Store #42 in Dubai (Store) on Tuesday January 14th (Date) to customers using a BOGO coupon (Promotion) for $329.53." NOW that's useful! Dimensions are the labels that make numbers tell a story.

Dimension tables are sometimes called the "soul" of the data warehouse because they're what users actually see and interact with. When someone filters by "Category = Electronics" or groups by "Region = Middle East" - they're using dimensions!

Design Tip

Each dimension should be single-valued for a given fact row. If a patient has multiple diagnoses, that's a special case (multivalued dimension) that needs a bridge table - we'll cover that in advanced topics!

Facts for Measurements

Facts are the numeric measurements that result from a business event. They're almost always numbers - things you can add up, average, count, etc.

💰

If dimensions are the "labels on a box," facts are "what's INSIDE the box." When a customer buys something, the facts are: How many did they buy? (quantity = 3). How much did they pay? (amount = $15.99). What was the discount? (discount = $2.00). These are the numbers the CEO cares about!

🚫

Only facts consistent with the declared grain are allowed! If your grain is "one product sold in one transaction," you can include quantity_sold and item_price. But the store manager's salary? Nope - that doesn't match this grain!

Star Schemas and OLAP Cubes

A Star Schema is the physical layout of your dimensional model in a database. It's called a "star" because when you draw it, the fact table sits in the center and dimension tables radiate outward like the points of a star.

⭐

Draw a circle in the middle of a paper (that's your fact table with all the numbers). Now draw 5-6 rectangles around it, connected with lines (those are your dimension tables with all the descriptions). What does it look like? A star! Hence the name. It's the most common and most understandable way to organize analytical data.

OLAP Cubes are similar in content but live in multidimensional databases. They allow more powerful analytical queries than regular SQL. Think of OLAP cubes as the "3D version" of a star schema - same data, fancier analysis tools.

Why Stars Are Great

Star schemas are simple for business users to understand, fast for databases to query (fewer JOINs!), and flexible enough to handle virtually any business question. That's why they've been the gold standard since 1996!

Graceful Extensions to Dimensional Modelling

One of the best things about dimensional models is they're resilient to change. When business needs evolve, you can extend without breaking anything!

🧩

Dimensional models are like LEGO. Need a new room? Snap on more bricks. No need to demolish the house! All existing reports and dashboards keep working perfectly.

Here are the 4 changes you can make without breaking any existing queries:

Add new facts: Got a new measurement? Just add a column to the fact table.
Add new dimensions: Need to track something new? Add a foreign key column to the fact table (as long as the grain doesn't change!).
Add new attributes: Want more details on a dimension? Add columns to the dimension table.
Make grain more atomic: Need finer detail? You can break down to a more detailed grain while preserving existing column names.

Quick Check - Do You Get It?

What does "grain" mean in dimensional modelling?

2

Basic Fact Table Techniques

Fact tables are where the numbers live. Let's learn how to build them right, with all their variations and gotchas!

Fact Table Structure

A fact table contains numeric measures from real-world events. Each row is one measurement event. Think of it as a receipt - each line is something that actually happened.

Every fact table has:

Foreign keys - links to each dimension table (date_key, product_key, store_key, etc.)
Numeric measures - the actual numbers (quantity, amount, cost)
Optional degenerate dimension keys - like invoice numbers (no separate dimension table)
Date/time stamps - for precise timing

sales_fact

date_key (FK)

product_key (FK)

store_key (FK)

customer_key (FK)

invoice_number (DD)

quantity_sold

unit_price

discount_amount

net_amount

A typical fact table: foreign keys at the top, numeric measures below. FK = Foreign Key, DD = Degenerate Dimension

Golden Rule

The design of a fact table is based entirely on what physically happens in the real world - NOT on what reports people want. If you design for reports, the model becomes brittle and breaks when someone asks a new question!

Additive, Semi-Additive, and Non-Additive Facts

Not all numbers can be added up the same way. This is one of the most important distinctions in fact tables!

🧮

Imagine you have 3 piggy banks in 3 rooms. Additive: "How much money in total?" Just add all three - $10 + $20 + $15 = $45. Works! Semi-additive: "What's the bank balance?" You can add across stores ($10+$20+$15) but NOT across months (January balance + February balance = nonsense!). Non-additive: "What's the profit margin percentage?" You can't add 20% + 30% + 25% - that's meaningless!

Type	Can Sum Across...	Examples	Tip
Additive	ALL dimensions	Revenue, Quantity, Cost	Best kind! Most flexible
Semi-Additive	SOME dimensions (not time)	Account Balance, Inventory Count	Use AVG or latest for time
Non-Additive	NONE (can't meaningfully sum)	Ratios, Percentages, Unit Prices	Store components, calculate later

Pro Tip

For non-additive facts like profit margin, store the additive components instead (revenue and cost) and let the BI tool calculate the ratio at the end. This way you can always roll up the components correctly!

Nulls in Fact Tables

Null measures (fact columns) are FINE. SUM, COUNT, MIN, MAX, AVG all handle nulls gracefully - they just skip them.

Null foreign keys are NOT fine. They break referential integrity. Instead, create a special "Unknown" or "Not Applicable" row in the dimension table with its own surrogate key.

📋

It's like a form. Leaving the "discount amount" blank is okay (the item had no discount). But leaving the "customer name" blank means we literally don't know who bought it. Instead, write "Unknown Customer" - at least now we know it's unknown rather than just missing!

Conformed Facts

When the same measurement appears in different fact tables, it MUST be defined identically if you want to compare them. Same formula, same units, same business rules.

📏

If the Sales team measures "revenue" as the amount before tax, but the Finance team measures "revenue" as the amount AFTER tax, comparing them side by side is like comparing meters to feet - you'll get wrong answers! If they mean the same thing, name them the same. If they're different, give them DIFFERENT names so nobody gets confused.

The Three Types of Fact Tables

There are exactly 3 fundamental types of fact tables. Every fact table in the universe is one of these:

Transaction Fact Tables

One row = one event at a point in time. Like a receipt line item.

🧾

Every beep at the cash register creates one row. Customer bought milk at 3:42 PM? That's one row. Bought bread at 3:42 PM? That's another row. Each row is a single, atomic event. These tables can be huge but very sparse - rows only appear when something happens.

Example

POS (Point of Sale) transactions, website clicks, ATM withdrawals, insurance claim filings

Periodic Snapshot Fact Tables

One row = a summary over a standard time period. Like a daily/weekly/monthly report card.

📅

Think of your bank statement. At the end of each month, the bank writes down: "Your balance was $5,234." They don't list every transaction - just the snapshot at that moment. Even if you had NO transactions that month, there's still a row showing your balance!

Example

Daily inventory levels, monthly account balances, weekly sales summaries. These are always dense - every row exists even with zero activity.

Accumulating Snapshot Fact Tables

One row = an entire process from start to finish, updated as milestones are reached.

📦

Imagine tracking a package. Row created: "Order placed Jan 5." Updated: "Shipped Jan 7." Updated: "In transit Jan 8." Updated: "Delivered Jan 10." One row, multiple date columns, continuously updated. Each date column is a milestone in the process!

Example

Order fulfillment pipeline, insurance claim processing, loan application workflow. Multiple date foreign keys track each milestone. Rows get UPDATED (unique among fact tables!).

Factless Fact Tables

Wait - a fact table with NO facts? Yes! Sometimes events don't have numbers, but the event itself is worth recording.

🎓

A student attends a class on Monday. There's no number to measure - no grade, no score. But the event "Student X attended Class Y with Teacher Z in Room 101 on Monday" is valuable! That's a factless fact table - just foreign keys, no measurements. You can also use them to track what DIDN'T happen: "Which students were enrolled but DIDN'T attend?"

Two Types of Factless Tables

Coverage tables record what COULD happen (all students enrolled in all classes). Activity tables record what DID happen. Subtract one from the other to find what DIDN'T happen!

Aggregate & Consolidated Fact Tables

Aggregate fact tables are pre-calculated summaries built to speed up queries. Think of them as cheat sheets - they hold pre-computed totals so the database doesn't have to crunch billions of rows every time.

🏎️

Imagine a library. The atomic fact table is every single book. The aggregate is a summary card: "Science section: 4,231 books, 12 shelves." When someone asks "How many science books?" you check the card (fast!) instead of counting every book (slow!). The BI tool should automatically pick the right level - this is called aggregate navigation.

Consolidated fact tables combine facts from multiple processes into one table WHEN they share the same grain. For example, putting actual sales and forecasted sales side by side makes comparison easy and fast.

3

Basic Dimension Table Techniques

Dimensions are the "soul" of the warehouse. They're what users see, filter, and group by. Let's master them!

Dimension Table Structure

Dimension tables are wide, flat, and denormalized. They have one primary key and LOTS of descriptive text columns. They're the opposite of what you'd design in a normalized OLTP system!

dim_product

product_key (PK, surrogate)

product_natural_key

product_name

product_description

brand_name

category_name

department_name

package_type

package_size

fat_content

diet_type

weight_uom

A typical dimension table: wide, flat, with many descriptive text columns. Brand, Category, Department all in ONE table (denormalized)!

Design Philosophy

Use verbose, meaningful descriptions - not cryptic codes! Instead of "status = A", store "status = Active." Users should be able to read dimension values without needing a decoder ring.

Dimension Surrogate Keys

Every dimension table gets an anonymous integer primary key - we call it a surrogate key. Simple numbers: 1, 2, 3, 4... That's it!

🎫

Think of a coat check at a restaurant. You hand in your coat (the real data), and they give you a ticket with a number on it: "#47." That ticket number has no meaning - it's just a simple way to identify YOUR coat. That's a surrogate key! We don't use the coat's brand name or color because those could change or be duplicated.

Why not use natural keys? Because natural keys (like employee ID "EMP-001") can change, come from multiple systems, be poorly managed, and when you track history with SCD Type 2, one employee will have MULTIPLE rows - so the natural key isn't unique anymore!

Exception

The Date dimension can use a meaningful key like 20250216 (YYYYMMDD format) instead of a surrogate key, because dates are stable and predictable. But still add a special row for "Unknown Date"!

Natural, Durable, and Supernatural Keys

Three types of keys you need to understand:

🪪

Natural Key = Your school student ID (EMP-001). Can change if you transfer schools!
Durable/Supernatural Key = Your government-issued national ID. Never changes even if you change jobs, move cities, or get married. It's permanent.
Surrogate Key = The coat check ticket number. Just for internal warehouse use, sequential, meaningless outside the system.

The durable key is important when a natural key can change. Example: An employee quits and is rehired - they get a NEW employee number, but the warehouse wants to track them as the SAME person. The durable key stays constant forever.

Drilling Down

Drilling down simply means adding more detail to a report. In SQL terms, you add another column to your GROUP BY clause.

🔍

Report shows: "Total Sales = $1M." Boss asks: "Break that down by region." Now you see: "East = $400K, West = $350K, Central = $250K." Boss says: "Drill into East by store." Now: "Store A = $200K, Store B = $150K, Store C = $50K." Each click adds a dimension attribute as a new GROUP BY column. No special hierarchy needed!

Freedom!

You don't need pre-defined drill paths! Users can drill down using ANY dimension attribute from ANY dimension attached to the fact table. Want to drill from Year to Product Brand? Go for it!

Degenerate Dimensions

A degenerate dimension is a dimension key that lives directly in the fact table with no corresponding dimension table. The most common example: invoice numbers, order numbers, ticket numbers.

🏷️

When you buy 5 items, all 5 line items belong to the same receipt number (#R-4521). The receipt number itself has no extra attributes - it's just a grouping label. So we don't create a whole table just for receipt numbers. We just drop the receipt number directly into the fact table as a "degenerate dimension."

Denormalized Flattened Dimensions

In dimensional modelling, we deliberately denormalize. Instead of having separate tables for Product → Brand → Category → Department, we smash them all into ONE flat dimension table.

📋

Instead of: "Product #123 → look up brand in brand table → look up category in category table → look up department in department table" (3 JOINs!), we just store: "Product: Organic Milk, Brand: Happy Cow, Category: Dairy, Department: Food" all in one row. One table. Zero JOINs. Fast and simple!

🧠

Resist the urge to normalize! If you've spent years in OLTP database design, denormalization will feel "wrong." But in dimensional modelling, it's absolutely the RIGHT approach. Simplicity + Speed = Happy Users!

Multiple Hierarchies, Flags & Null Attributes

Multiple Hierarchies: A single dimension can have many hierarchies living side by side. A Date dimension has Day→Week→Fiscal Period AND Day→Month→Year. A Geography dimension has City→State→Country AND City→Sales Region→Division. They all coexist peacefully in one table!

Flags and Indicators: Replace cryptic codes with readable text! Instead of "status = T/F", use "is_active = Yes/No". Instead of "type = A", use "customer_type = Active Premium Member". When codes have embedded meaning (like "NYC-RET-001"), break them into separate columns: city = New York, type = Retail, sequence = 001.

Null Attributes: Avoid NULL in dimension attributes! Use descriptive strings like "Unknown" or "Not Applicable" instead. Different databases handle NULLs in GROUP BY and WHERE clauses inconsistently, which leads to bugs.

Calendar Date Dimensions

The date dimension is the most common dimension - attached to virtually every fact table. It's a pre-built calendar that makes date-based analysis a breeze.

📆

Ever tried to figure out when Easter falls using math? Good luck! That's why we have a date dimension table - it pre-computes everything: "Is this a holiday? What fiscal quarter? What week number? Is it a weekend?" You'd never want to calculate this stuff in SQL every time. Just look it up!

dim_date

date_key (20250216)

full_date (2025-02-16)

day_of_week (Sunday)

day_number (16)

week_number (7)

month_name (February)

quarter (Q1)

year (2025)

fiscal_period (FP11)

is_holiday (No)

is_weekend (Yes)

holiday_name (N/A)

A date dimension - pre-computed calendar attributes. The primary key can be YYYYMMDD for easy partitioning.

If you need time-of-day precision, add a separate timestamp column to the fact table. If users filter by day parts (morning shift vs. evening shift), create a separate time-of-day dimension.

Role-Playing, Junk, Snowflaked & Outrigger Dimensions

Role-Playing Dimensions

One physical dimension table used multiple times in the same fact table. Example: An order has an "order date," a "ship date," and a "delivery date" - all pointing to the SAME date dimension through different foreign keys. Each reference is a separate "role" (view with unique column names).

Junk Dimensions

Miscellaneous low-cardinality flags (payment method, order priority, gift wrap yes/no) combined into a single "transaction profile" dimension instead of creating dozens of tiny separate dimensions.

🗑️

Think of a junk drawer in your kitchen - all the small random stuff that doesn't deserve its own drawer (tape, batteries, pens). Same idea: all the small flags that don't deserve their own dimension table get combined into one "junk dimension." Despite the name, it's a perfectly valid and useful technique!

Snowflaked Dimensions

When dimension hierarchies are normalized into separate tables (Product → Brand table → Category table). Called "snowflake" because it looks like a snowflake instead of a star.

❌

AVOID snowflaking! It makes queries harder to write, confuses business users, and can hurt performance. A flat denormalized dimension contains exactly the same information and is much easier to use.

Outrigger Dimensions

A dimension that references another dimension. Example: A bank account dimension has an "account_open_date" that points to the date dimension. Use sparingly! In most cases, put both dimensions as separate foreign keys in the fact table instead.

4

Integration via Conformed Dimensions

This is how you connect different parts of the business together. The secret sauce of enterprise data warehousing!

Conformed Dimensions

Conformed dimensions are the glue that holds the entire enterprise data warehouse together. When the same dimension (e.g., Customer, Product, Date) is shared across multiple fact tables with identical column names and values, you can combine reports across different business processes.

🔌

Think of electrical outlets. Every device in your house uses the SAME type of plug. Because they're "conformed" to one standard, you can plug your TV, phone charger, and blender into any outlet. If every room had a different plug type, chaos! Conformed dimensions are the universal plug standard of data warehousing.

The Essence of Integration

When you use the same conformed dimension as a row header in a drill-across report, results from separate fact tables align perfectly on the same rows. This is what makes enterprise analytics possible. Build once, reuse everywhere!

Shrunken Rollup Dimensions & Drilling Across

Shrunken Rollup Dimensions are subsets of a full dimension - fewer rows and/or columns. Used for aggregate fact tables or when a business process captures data at a higher level (e.g., forecasts by month+brand instead of day+product).

Drilling Across means querying two or more fact tables separately, then merging the results on common conformed dimension attributes. The key: each query runs independently, and the answers are aligned by sort-merging on shared row headers.

🧩

Think of two different spreadsheets: Sales by Product and Returns by Product. You can't just JOIN them (that creates messy duplicates!). Instead, run each query separately, get the answers, then merge them side by side: "Product A: Sales=$100, Returns=$5. Product B: Sales=$200, Returns=$12." That's drilling across!

Value Chain & Enterprise Bus Architecture

The Value Chain is the natural flow of business processes in your organization. For retail: Purchasing → Warehousing → Retail Sales. For finance: Budgeting → Commitments → Payments. Each step produces its own fact table.

The Enterprise Data Warehouse Bus Architecture provides an incremental, manageable approach to building the warehouse. It says: "Focus on one business process at a time, but make sure your dimensions are conformed across all of them." This way, each piece works alone AND integrates with everything else.

The Bus Matrix: rows = business processes, columns = dimensions. Shared dimensions enable cross-process analytics!

Opportunity/Stakeholder Matrix

A variation of the Bus Matrix where dimension columns are replaced with business functions (Marketing, Sales, Finance, Operations). The shaded cells show which business groups care about which processes.

📋

It's an RSVP list for your data modelling workshops! "For the Sales fact table design session, invite: Marketing (they want campaign analysis), Sales (they want commission tracking), and Finance (they want revenue reports)." This matrix makes sure no stakeholder gets left out.

5

Slowly Changing Dimensions (SCD)

Data changes over time. People move, products get renamed, prices change. How do we handle this? With 8 different techniques!

🏠

Imagine a customer named Sarah. She lives in Dubai, has a Silver membership, and her email is sarah@email.com. One day she moves to Abu Dhabi and upgrades to Gold. How do we record this change? Do we update the old record? Keep both? That's what SCD types are all about!

Type 0

Retain Original

The value NEVER changes. Whatever was first recorded stays forever.

Use When

"Original credit score", date of birth, first hire date. Things labeled "original" that should never be overwritten.

Type 1

Overwrite

Simply overwrite the old value with the new one. History is destroyed.

🗑️

Like correcting a typo - just erase and rewrite. "Sarah Dubai" becomes "Sarah Abu Dhabi". The old address vanishes. Simple but lossy!

Type 2

Add New Row

The MOST important SCD! Creates a NEW row with a new surrogate key. Old row is marked as expired. Full history preserved!

How It Works

Row 1: Sarah, Dubai, effective 2020-01-01, expired 2025-01-15, current=N
Row 2: Sarah, Abu Dhabi, effective 2025-01-16, expired 9999-12-31, current=Y
Three extra columns: effective_date, expiration_date, current_row_indicator

Type 3

Add New Attribute

Add a NEW column to store the old value. One row, but with both old and new values side by side. Called "alternate reality."

Example

One row: Sarah | current_city = Abu Dhabi | previous_city = Dubai. Users can filter by either! Rarely used because it only tracks ONE change.

Type 4

Add Mini-Dimension

For rapidly changing attributes in huge dimensions. Split the fast-changing columns into a separate "mini-dimension" table.

📦

Customer dimension has 10M rows. Age band and income band change frequently. Instead of creating 10M new Type 2 rows every time, pull those columns into a small "demographics mini-dimension" with maybe 1,000 rows.

Type 5

Mini-Dimension + Type 1 Outrigger

Builds on Type 4: the mini-dimension is also embedded as a Type 1 outrigger in the base dimension, so you can see CURRENT mini-dimension values without going through the fact table.

Type 6

Type 1 + Type 2 Hybrid

Adds Type 1 (current value) columns to a Type 2 dimension. Every row has both the "as was" value and the "as is" value. Filter by either one!

Example

Row 1: Sarah, historical_city=Dubai, current_city=Abu Dhabi, eff=2020
Row 2: Sarah, historical_city=Abu Dhabi, current_city=Abu Dhabi, eff=2025
The "current_city" is overwritten (Type 1) on ALL rows when it changes.

Type 7

Dual Type 1 + Type 2

The fact table has TWO keys to the same dimension: a surrogate key (for Type 2 "as-was" view) AND a durable key (for Type 1 "as-is" view). Two different views of the same dimension for maximum flexibility.

Quick Check!

A customer changes their address. You need to see what their address was at the time of each order AND their current address. Which SCD type?

6

Dealing with Dimension Hierarchies

From simple Product→Brand→Category to complex organizational structures with variable depth!

Fixed Depth Positional Hierarchies

A fixed depth hierarchy has a predictable, consistent number of levels with agreed-upon names. Example: Product → Brand → Category → Department. Always exactly 4 levels, always the same names.

🏢

Think of a company org chart with exactly 4 levels: CEO → VP → Director → Manager. Every person fits into exactly one of these levels. Easy! Each level gets its own column in the dimension table: ceo_name, vp_name, director_name, manager_name. Simple, fast, and predictable.

Best Practice

Fixed depth hierarchies are BY FAR the easiest to query and navigate. If your hierarchy fits this pattern, always use positional attributes in a flat dimension table. Don't overcomplicate it!

Slightly Ragged / Variable Depth Hierarchies

Sometimes the hierarchy depth varies a little bit - like geography: some places have City → State → Country (3 levels), while others have City → Province → Region → Country (4 levels).

🗺️

Some countries have states, some have provinces, some have regions. The depth varies from 3 to maybe 6 levels. Rather than using complex recursive queries, just create columns for the MAXIMUM depth and fill in the blanks with business rules. Level 4 doesn't exist? Repeat Level 3's value. Simple!

Ragged / Variable Depth Hierarchies

The really tricky ones! Organization charts, bill of materials, account charts - where the depth is completely unpredictable. Might be 2 levels, might be 20.

Solution 1: Bridge Table - A special helper table that stores every possible path through the hierarchy. One row for each "from-to" path. Enables all hierarchy queries with standard SQL.

Solution 2: Pathstring Attribute - A single column storing the full path as a string: "CEO|VP Sales|Director East|Manager NY". Can handle many hierarchy queries with LIKE operators, but less flexible for substituting alternative hierarchies.

⚠️

Ragged hierarchies are genuinely difficult in relational databases. SQL recursive queries (WITH RECURSIVE) have limitations. The bridge table approach is the most robust solution, though it requires extra ETL work.

7

Advanced Fact Table Techniques

Pro-level patterns for handling complex measurement scenarios. These are the techniques that separate beginners from experts!

Fact Table Surrogate Keys

While not required, adding a single surrogate key column to fact tables is useful. It serves as a simple primary key, helps identify rows during ETL, allows interrupted loads to resume, and converts risky UPDATE operations into safer INSERT+DELETE pairs.

Centipede Fact Tables (Anti-Pattern!)

🐛

A centipede fact table has dozens of foreign keys because someone created separate dimensions for every level of a hierarchy (date dim, month dim, quarter dim, year dim - all as separate foreign keys). DON'T DO THIS! Collapse them into one dimension at the lowest grain (just the date dimension).

🦶

If your fact table has so many foreign keys it looks like a centipede (dozens of little "legs"), something went wrong. Either collapse hierarchical dimensions back to their lowest grain, or combine miscellaneous flags into a junk dimension.

Numeric Values as Attributes or Facts

Some numbers are ambiguous - is "list price" a fact or a dimension attribute? Here's the rule:

Used for calculations? → Put it in the fact table
Used for filtering/grouping? → Put it in the dimension table
Used for both? → Put it in both! Plus add value bands in the dimension ("$0-$50", "$50-$100")

Lag/Duration Facts

In accumulating snapshots with multiple milestones, users want to analyze time lags between steps. Smart approach: store ONE lag per step (measured from the start), then any lag between two steps is a simple subtraction.

⏱️

A marathon has mile markers. Instead of recording the time between every possible pair of markers (that's hundreds of combinations!), just record the time from the start to each marker: Mile 1 = 8min, Mile 2 = 16min, Mile 3 = 25min. Time between Mile 1 and Mile 3? Just subtract: 25 - 8 = 17min. Simple!

Header/Line & Allocated Facts

Header/Line: Transactions often have a header (the order) and lines (individual items). In the fact table, include ALL header-level dimension keys on every line-level row. Don't force users to join back to a header table!

Allocated Facts: When a fact exists at the header level (like shipping cost for the whole order), allocate it down to the line level proportionally. This way, every fact can be sliced by every dimension.

📦

You order 3 items online with $12 shipping. How do you split the shipping across items? By price proportion: Item A ($60, 60%) gets $7.20 shipping, Item B ($30, 30%) gets $3.60, Item C ($10, 10%) gets $1.20. Now you can analyze shipping cost by product, category, anything!

P&L, Multiple Currencies, Units of Measure & YTD

Profit & Loss Tables: Revenue - Costs = Profit at atomic grain. Incredibly powerful but politically charged since cost allocation requires executive buy-in. Not a first-phase project!

Multiple Currencies: Store PAIRS of columns: one in the transaction's original currency, one in a standard currency (USD). Add a currency dimension to identify the original currency.

Multiple Units of Measure: Store facts once in a standard unit, plus conversion factors for all other units. Deploy views for each user group with their preferred unit. The conversion factors MUST live in the fact row for accuracy.

Year-to-Date: Don't store YTD in fact tables! Calculate it in the BI layer instead. YTD is a moving target - "YTD as of fiscal period end" is different from "YTD as of today" - and storing it creates maintenance nightmares.

Multipass SQL, Timespan Tracking & Late Arriving Facts

Multipass SQL (NEVER join fact-to-fact!): If you need to compare shipments and returns, DON'T join the two fact tables directly. It creates uncontrollable row explosions with wrong results. Instead, query each fact table separately and merge the results on common dimension attributes. This is called "drill-across."

💥

Joining two fact tables across foreign keys is one of the most common and dangerous mistakes in dimensional modelling. The cardinality explodes and results are WRONG. Always use drill-across (separate queries, then merge)!

Timespan Tracking: In rare cases, add effective/expiration dates to fact rows (like SCD Type 2 for facts). Useful for slowly changing balances where periodic snapshots would create identical duplicate rows.

Late Arriving Facts: When a fact row arrives late (delayed data feed), you need to find the dimension keys that were effective at the TIME the event actually occurred, not when it arrived. Search the dimension's history to find the right surrogate keys.

8

Advanced Dimension Table Techniques

Expert-level patterns for complex real-world scenarios. Multivalued relationships, behavior tracking, and more!

Dimension-to-Dimension Table Joins

When a dimension references another (via outrigger), Type 2 changes in the outrigger can force explosive Type 2 growth in the base dimension. Solution: demote the relationship to the fact table, where both dimensions are separate foreign keys.

🔗

Imagine the Account dimension links to the Branch dimension via an outrigger. Every time a branch changes (name, manager), ALL accounts at that branch need new Type 2 rows. That's millions of unnecessary rows! Instead, put both account_key AND branch_key directly in the fact table. The correlation is discovered through the fact table, which is perfectly fine for periodic snapshots.

Multivalued Dimensions & Bridge Tables

Normally, each dimension has ONE value per fact row. But sometimes there are legitimately MULTIPLE values - like a patient with 3 simultaneous diagnoses.

🏥

Patient John visits the hospital. He has: diabetes, high blood pressure, AND asthma. One treatment event, but THREE diagnoses. We can't put all three in one dimension column! Solution: a bridge table that maps a "diagnosis group key" to multiple individual diagnosis rows. The fact table points to the group, the bridge table expands the group into its members.

Bridge tables may also need effective/expiration dates (like Type 2 SCD) when the multivalued relationships change over time, such as bank accounts linked to multiple customers.

Behavior Tag Time Series & Study Groups

Behavior Tags: Data mining produces textual behavior labels (like "High Spender", "Bargain Hunter", "Churning") assigned periodically. Store these as positional attributes in the customer dimension: behavior_jan, behavior_feb, etc. They're used for pattern matching, not math.

Study Groups: Complex behavior analysis (like "customers who bought Product A, then returned Product B within 30 days, and haven't bought in 90 days") is too complex for normal queries. Run the analysis once, save the results in a simple table of customer durable keys, and use that table as a filter in any schema.

🔬

Study groups are like "saved playlists" for customer segments. A data scientist spends days finding "customers likely to churn." Instead of re-running that analysis in every report, they save the list. Now any report can filter: "Show me sales ONLY for the 'likely to churn' group." You can even combine groups: "churning AND high-value" using set operations!

Aggregated Facts as Attributes, Value Banding & Text Comments

Aggregated Facts as Dimension Attributes: "Show me only customers who spent over $10,000 last year." To enable this, pre-calculate aggregated metrics and store them in the dimension table, often as banded ranges: "Lifetime Value = $10K-$25K".

Dynamic Value Banding: Reports grouped by value ranges ("Balance $0-$10", "$10-$25", etc.) where the ranges are defined at query time using a small banding dimension table joined via greater-than/less-than conditions, or SQL CASE statements.

Text Comments: Free-form text comments don't belong in fact tables! Store them in a separate comments dimension with a foreign key from the fact table. This keeps the fact table lean and fast.

More Advanced Techniques: Time Zones, Measure Types, Steps & More

Multiple Time Zones: Use dual foreign keys in the fact table - one for UTC standard time and one for local time, each pointing to role-playing date dimensions.

Measure Type Dimensions: When a fact table has hundreds of sparsely populated fact columns, you MIGHT collapse them into a single "measure_value" column identified by a "measure_type" dimension. Generally NOT recommended because it multiplies fact table rows and makes cross-column math hard. Only use when facts number in the hundreds.

Step Dimensions: For sequential processes (web page flows), a step dimension tracks where each step fits: step_number=3, total_steps=7. Helps analyze drop-off points!

Hot Swappable Dimensions: Same fact table, different dimension versions for different users. A stock ticker fact table exposed to multiple investors, each with their own proprietary attributes on the same stocks.

Abstract Generic Dimensions (AVOID!): Don't create a single "Person" dimension for employees, customers, and vendors. Don't create a generic "Location" dimension for stores, warehouses, and customer addresses. These have different attributes and create large, confusing tables. Keep them separate!

🚫

Data abstraction is fine in OLTP source systems and ETL processing, but it HURTS query performance and legibility in dimensional models. Keep your dimensions specific and meaningful!

Audit Dimensions: Track ETL metadata! When a fact row is created, stamp it with: data quality score, ETL code version, processing timestamp, source system identifier. Invaluable for compliance and debugging.

Late Arriving Dimensions: When facts arrive before their dimension context (real-time systems), create placeholder dimension rows with the natural key and "Unknown" for descriptive attributes. When the real data arrives later, overwrite with Type 1. For retroactive Type 2 changes, insert a new historical dimension row and restate affected fact rows.

9

Special Purpose Schemas

Unique patterns for heterogeneous products, real-time data, and data quality management!

Supertype & Subtype Schemas

When a business offers wildly different products (checking accounts, mortgages, business loans - all "accounts" but completely different), a single fact table with every possible column is impossible.

🏦

A bank has checking accounts, savings, credit cards, mortgages, and business loans. They're all "accounts" but have completely different measurements. You can't put "interest rate", "monthly payment", "credit limit", "overdraft fee", and "loan balance" all in one table - most columns would be empty!

Solution: Build a supertype (core) fact table with common facts shared by ALL account types (balance, transaction count), plus separate subtype (custom) fact tables for each account type with their unique facts. Same pattern for dimensions: one core dimension with common attributes, plus subtype-specific dimension tables.

Supertype/Subtype pattern: One core table for shared metrics, separate tables for product-specific details

Real-Time Fact Tables

When the nightly batch isn't fast enough and you need data updated in real-time (or near real-time). Techniques include:

Hot partitions: A special partition pinned in memory, without indexes or aggregations, for incoming real-time data
Deferred updating: Let running queries complete, then apply updates
OLAP cube refresh: Some OLAP engines support incremental updates

🏎️

The nightly batch is like a newspaper - printed once a day. Real-time is like a live Twitter feed - always updating. The "hot partition" is like a breaking news ticker at the bottom of the TV screen - fast and temporary, while the main program (batch) runs on schedule.

Error Event Schemas

Data quality requires a system of screens/filters that test data as it flows from source to warehouse. When errors are detected, they're recorded in a special dimensional schema in the ETL back room.

Two related tables:

Error Event Fact Table: One row per error detected (grain = individual error event)
Error Event Detail Fact Table: One row per column in each table that participated in an error (grain = column level)

🔍

Think of airport security. Every bag goes through an X-ray machine (data quality screen). When something suspicious is found, it's logged: "Bag #4521, Screen: Metal Detector, Error: Prohibited Item, Time: 3:42 PM." This error log lets you track data quality trends, find problematic sources, and demonstrate compliance to auditors.

🎉

You Made It!

You've just covered ALL the Kimball Dimensional Modeling Techniques - from basic star schemas to advanced bridge tables and error event schemas. That's the complete toolkit used by data warehouse professionals worldwide since 1996!

9

Major Sections

60+

Techniques Covered

8

SCD Types

3

Fact Table Types

Data Modelling for Dummies

What's Inside This Guide?

1. Fundamental Concepts

2. Basic Fact Tables

3. Basic Dimension Tables

4. Conformed Dimensions

5. Slowly Changing Dimensions

6. Dimension Hierarchies

7. Advanced Fact Tables

8. Advanced Dimension Tables

9. Special Purpose Schemas

Fundamental Concepts

Gather Business Requirements & Data Realities

Collaborative Dimensional Modeling Workshops

The Four-Step Dimensional Design Process

Business Processes

Grain - The Most Critical Decision!

Dimensions for Descriptive Context

Facts for Measurements

Star Schemas and OLAP Cubes

Graceful Extensions to Dimensional Modelling

Quick Check - Do You Get It?

Basic Fact Table Techniques

Fact Table Structure

Additive, Semi-Additive, and Non-Additive Facts

Nulls in Fact Tables

Conformed Facts

The Three Types of Fact Tables

Transaction Fact Tables

Periodic Snapshot Fact Tables

Accumulating Snapshot Fact Tables

Factless Fact Tables

Aggregate & Consolidated Fact Tables

Basic Dimension Table Techniques

Dimension Table Structure

Dimension Surrogate Keys

Natural, Durable, and Supernatural Keys

Drilling Down

Degenerate Dimensions

Denormalized Flattened Dimensions

Multiple Hierarchies, Flags & Null Attributes

Calendar Date Dimensions

Role-Playing, Junk, Snowflaked & Outrigger Dimensions

Role-Playing Dimensions

Junk Dimensions

Snowflaked Dimensions

Outrigger Dimensions

Integration via Conformed Dimensions

Conformed Dimensions

Shrunken Rollup Dimensions & Drilling Across

Value Chain & Enterprise Bus Architecture

Opportunity/Stakeholder Matrix

Slowly Changing Dimensions (SCD)

Retain Original

Overwrite

Add New Row

Add New Attribute

Add Mini-Dimension

Mini-Dimension + Type 1 Outrigger

Type 1 + Type 2 Hybrid

Dual Type 1 + Type 2

Quick Check!

Dealing with Dimension Hierarchies

Fixed Depth Positional Hierarchies

Slightly Ragged / Variable Depth Hierarchies

Ragged / Variable Depth Hierarchies

Advanced Fact Table Techniques

Fact Table Surrogate Keys

Centipede Fact Tables (Anti-Pattern!)

Numeric Values as Attributes or Facts

Lag/Duration Facts

Header/Line & Allocated Facts

P&L, Multiple Currencies, Units of Measure & YTD

Multipass SQL, Timespan Tracking & Late Arriving Facts

Advanced Dimension Table Techniques

Dimension-to-Dimension Table Joins

Multivalued Dimensions & Bridge Tables

Behavior Tag Time Series & Study Groups

Aggregated Facts as Attributes, Value Banding & Text Comments

More Advanced Techniques: Time Zones, Measure Types, Steps & More