tim laqua dot com Thoughts and Code from Tim Laqua

27Feb/124

Slowly Changing Dimensions with MD5 Hashes in SSIS

We recently moved away from the 3rd party Checksum component (and all 3rd party components) in SSIS and I wanted to share the pattern we settled on for maintaining our Type 1 Slowly Changing Dimensions (SCDs). There are two things we wanted to address with our new pattern. First, our previous implementation wasn't performing as well as we needed it to or generating reliable checksums. The second was that we wanted to get away from dependencies on custom assemblies in general. To illustrate the pattern, we're going to build a SCD package off the Adventure Works DW DimCustomer table and skip over the actual source of the business keys and attributes by selecting directly from the completed dimension for now.

First, we assume that our dimension already exists (and we were using some other checksum or MERGE to maintain it). We have to add a column to store the MD5 hash:

ALTER TABLE dbo.DimCustomer ADD
	MD5 VARCHAR(34) NOT NULL DEFAULT ''

Second, we need a staging table to store updated/changed rows. Script out the current dimension as a CREATE, remove all unneeded constraints and indexes, and create a staging table as a heap:

CREATE TABLE [dbo].[Staging_DimCustomer_UpdatedRows](
	[CustomerKey] [int] NOT NULL,
	[GeographyKey] [int] NULL,
	[CustomerAlternateKey] [nvarchar](15) NOT NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[NameStyle] [bit] NULL,
	[BirthDate] [datetime] NULL,
	[MaritalStatus] [nchar](1) NULL,
	[Suffix] [nvarchar](10) NULL,
	[Gender] [nvarchar](1) NULL,
	[EmailAddress] [nvarchar](50) NULL,
	[YearlyIncome] [money] NULL,
	[TotalChildren] [tinyint] NULL,
	[NumberChildrenAtHome] [tinyint] NULL,
	[EnglishEducation] [nvarchar](40) NULL,
	[SpanishEducation] [nvarchar](40) NULL,
	[FrenchEducation] [nvarchar](40) NULL,
	[EnglishOccupation] [nvarchar](100) NULL,
	[SpanishOccupation] [nvarchar](100) NULL,
	[FrenchOccupation] [nvarchar](100) NULL,
	[HouseOwnerFlag] [nchar](1) NULL,
	[NumberCarsOwned] [tinyint] NULL,
	[AddressLine1] [nvarchar](120) NULL,
	[AddressLine2] [nvarchar](120) NULL,
	[Phone] [nvarchar](20) NULL,
	[DateFirstPurchase] [datetime] NULL,
	[CommuteDistance] [nvarchar](15) NULL,
	[MD5] [varchar](34) NOT NULL)

Now in to SSIS - We will be building:

  1. Execute SQL Task to Truncate our Staging table(s)
  2. Data Flow Task to Insert new rows and Stage updated rows
  1. OLE DB Source to retrieve our source data
  2. Script Component to Generate Row Numbers
  3. Conditional Split to Evenly Distribute Rows
  4. Script Component to Generate MD5 Hashes
  5. Union All to Squish it all back together
  6. Lookup to get the existing MD5 Hash (if it exists)
  7. Conditional Split to separate Unchanged and Changed rows
  8. RowCount Transformation
  9. OLE DB Destination for Changed rows
  10. OLE DB Destination for New rows
  • Execute SQL Task to Update changed rows
  • Completed Control Flow

    Completed Data Flow