Tracking how data changes over time can be a headache — especially when you’re trying to preserve history without creating a mess. At Emerald Codeworks, we use a practical, SQL-based approach to managing Type 2 Slowly Changing Dimensions (SCDs) that keeps things clean, accurate, and easy to work with. In this post, we’ll break down how we structure our tables, detect changes, and handle the trade-offs so your data warehouse stays reliable — and your team stays sane.
At Emerald Codeworks, we spend a lot of time helping businesses make sense of their data — especially when it comes to tracking how that data changes over time. One common challenge in data warehousing is managing Slowly Changing Dimensions (SCDs) — particularly Type 2 SCDs, where we want to keep a history of changes rather than overwriting data.
This can get complex quickly, but over the years we’ve developed a practical approach that keeps things manageable (and keeps us sane). In this post, we’ll walk through how we handle Type 2 SCDs using SQL, the trade-offs we consider, and a few tips for keeping your data pipelines maintainable.
What Is a Type 2 Slowly Changing Dimension?
Let’s say you’re tracking customer information in your warehouse, and someone updates their email or moves to a new city. A Type 1 update would just overwrite the old values. But in Type 2, you preserve the history by inserting a new row with the updated values and marking the previous record as no longer current.
This lets you analyze the state of the world at any point in time — super useful for auditing, analytics, or compliance.
A Typical Schema
We add a few extra fields to our dimension tables to support Type 2 changes:
CREATE TABLE dim_customer (
customer_id INT,
customer_sk INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name VARCHAR,
email VARCHAR,
city VARCHAR,
valid_from DATE,
valid_to DATE,
is_current BOOLEAN
);
Detecting Changes (and What Counts as a Change)
We usually run a daily or hourly sync from the source system and compare it to what’s in our warehouse. The logic is:
Here’s a simplified version of how we do it in SQL (or dbt for our projects):
WITH source_data AS (
SELECT * FROM staging_customer
),
latest_existing AS (
SELECT *
FROM dim_customer
WHERE is_current = TRUE
),
changed_records AS (
SELECT src.*
FROM source_data src
LEFT JOIN latest_existing dim
ON src.customer_id = dim.customer_id
WHERE dim.customer_id IS NULL
OR src.email <> dim.email
OR src.city <> dim.city
)
-- expire existing records
UPDATE dim_customer
SET valid_to = CURRENT_DATE - 1,
is_current = FALSE
WHERE customer_id IN (SELECT customer_id FROM changed_records)
AND is_current = TRUE;
-- insert new records
INSERT INTO dim_customer (
customer_id, name, email, city, valid_from, valid_to, is_current
)
SELECT
customer_id, name, email, city,
CURRENT_DATE, '9999-12-31', TRUE
FROM changed_records;
You could also use hashes for change detection if your rows have lots of fields.
Trade-Offs to Consider
Decision | Trade-Off |
---|---|
Date ranges vs version numbers | Dates are more readable and useful for time-based queries. Version numbers are simpler for debugging but harder to use in analysis. |
Hashing rows to detect changes? | Improves performance, especially for wide tables, but introduces the risk of false positives due to precision or null handling. |
Use is_current flag? | Makes querying current records faster, but adds complexity to your update logic. |
dbt vs SQL scripts? | We lean toward dbt for transparency, reusability, and testing, but for quick jobs we’ll sometimes just write SQL. |
Our Tips for Staying Sane
CREATE VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = TRUE;
Wrapping Up
Handling Type 2 Slowly Changing Dimensions is one of those things that sounds easy until you’re doing it at scale, with real-world messiness. But with a few guardrails and solid SQL patterns, it becomes just another part of your data stack — not something to dread.
At Emerald Codeworks, we help teams untangle these kinds of data problems every day. If you’ve got messy dimensions, multiple systems, or just need to see how your data has evolved over time — we can help.
Don’t let bad data creep into your dashboards. This post breaks down how we use tools like Soda and dbt to catch data issues before they cause downstr…
According to recent reports, the total amount of data created, captured, copied, and consumed globally is projected to reach 149 zettabytes in 2024, a…
Partner with Emerald CodeWorks to build the modern data stack your business needs to grow with confidence.
Lets Talk