tim laqua dot com Thoughts and Code from Tim Laqua

27Feb/124

Slowly Changing Dimensions with MD5 Hashes in SSIS

We recently moved away from the 3rd party Checksum component (and all 3rd party components) in SSIS and I wanted to share the pattern we settled on for maintaining our Type 1 Slowly Changing Dimensions (SCDs). There are two things we wanted to address with our new pattern. First, our previous implementation wasn't performing as well as we needed it to or generating reliable checksums. The second was that we wanted to get away from dependencies on custom assemblies in general. To illustrate the pattern, we're going to build a SCD package off the Adventure Works DW DimCustomer table and skip over the actual source of the business keys and attributes by selecting directly from the completed dimension for now.

First, we assume that our dimension already exists (and we were using some other checksum or MERGE to maintain it). We have to add a column to store the MD5 hash:

ALTER TABLE dbo.DimCustomer ADD
	MD5 VARCHAR(34) NOT NULL DEFAULT ''

Second, we need a staging table to store updated/changed rows. Script out the current dimension as a CREATE, remove all unneeded constraints and indexes, and create a staging table as a heap:

CREATE TABLE [dbo].[Staging_DimCustomer_UpdatedRows](
	[CustomerKey] [int] NOT NULL,
	[GeographyKey] [int] NULL,
	[CustomerAlternateKey] [nvarchar](15) NOT NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[NameStyle] [bit] NULL,
	[BirthDate] [datetime] NULL,
	[MaritalStatus] [nchar](1) NULL,
	[Suffix] [nvarchar](10) NULL,
	[Gender] [nvarchar](1) NULL,
	[EmailAddress] [nvarchar](50) NULL,
	[YearlyIncome] [money] NULL,
	[TotalChildren] [tinyint] NULL,
	[NumberChildrenAtHome] [tinyint] NULL,
	[EnglishEducation] [nvarchar](40) NULL,
	[SpanishEducation] [nvarchar](40) NULL,
	[FrenchEducation] [nvarchar](40) NULL,
	[EnglishOccupation] [nvarchar](100) NULL,
	[SpanishOccupation] [nvarchar](100) NULL,
	[FrenchOccupation] [nvarchar](100) NULL,
	[HouseOwnerFlag] [nchar](1) NULL,
	[NumberCarsOwned] [tinyint] NULL,
	[AddressLine1] [nvarchar](120) NULL,
	[AddressLine2] [nvarchar](120) NULL,
	[Phone] [nvarchar](20) NULL,
	[DateFirstPurchase] [datetime] NULL,
	[CommuteDistance] [nvarchar](15) NULL,
	[MD5] [varchar](34) NOT NULL)

Now in to SSIS - We will be building:

  1. Execute SQL Task to Truncate our Staging table(s)
  2. Data Flow Task to Insert new rows and Stage updated rows
    1. OLE DB Source to retrieve our source data
    2. Script Component to Generate Row Numbers
    3. Conditional Split to Evenly Distribute Rows
    4. Script Component to Generate MD5 Hashes
    5. Union All to Squish it all back together
    6. Lookup to get the existing MD5 Hash (if it exists)
    7. Conditional Split to separate Unchanged and Changed rows
    8. RowCount Transformation
    9. OLE DB Destination for Changed rows
    10. OLE DB Destination for New rows
  3. Execute SQL Task to Update changed rows

Completed Control Flow

Completed Data Flow

1: Execute SQL Task to Truncate our Staging table(s)
TRUNCATE TABLE Staging_DimCustomer_UpdatedRows;
2.1: OLE DB Source to retrieve our source data

Configure the Connection Manager - for the source we'll just select from the dimension minus the surrogate key and MD5 column:

You most likely won't need to exclude any columns - here we unselect the surrogate key and the MD5 column:

2.2: Script Component to Generate Row Numbers

Add one output column to Output 0 in your Row Number Script Transformation

The following code will assign the current row number and increment it for the next row through the buffer:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
 
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private int _rowCount = 1;
 
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        Row.RowNumber = _rowCount++;
    }
}
2.3: Conditional Split to Evenly Distribute Rows

Now in the conditional split, we distribute rows across multiple outputs by using the modulo operator. Modify to suit the number of threads you need to remove any bottlenecks in the MD5 calculation (you may only need one thread):

2.4: Script Component to Generate MD5 Hashes

Add one output column to Output 0 in your Generate MD5 Script Transformation

Select all input columns or just select ones that are going to be in the hash. Since we explicitly specify columns in the script, the hash will only be generated with exactly what columns you tell it to use.

The following script will generate a string representation of a MD5 hash. We do this so that we can do direct comparisons with HASHBYTES converted output (and virtually everything likes dealing with strings better than binary). We do end up doubling the storage cost of the MD5 hash by storing it as a string, but we were fine with that:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
 
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private System.Security.Cryptography.MD5 _md5 =
                new System.Security.Cryptography.MD5CryptoServiceProvider();
 
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        try
        {
            string hashSource =
                (Row.GeographyKey_IsNull ? "" : Row.GeographyKey.ToString())
                + (Row.Title_IsNull ? "" : Row.Title)
                + (Row.FirstName_IsNull ? "" : Row.FirstName)
                + (Row.MiddleName_IsNull ? "" : Row.MiddleName)
                + (Row.LastName_IsNull ? "" : Row.LastName)
                + (Row.NameStyle_IsNull ? "" : Row.NameStyle.ToString())
                + (Row.BirthDate_IsNull ? "" : Row.BirthDate.ToString())
                + (Row.MaritalStatus_IsNull ? "" : Row.MaritalStatus)
                + (Row.Suffix_IsNull ? "" : Row.Suffix)
                + (Row.Gender_IsNull ? "" : Row.Gender)
                + (Row.EmailAddress_IsNull ? "" : Row.EmailAddress)
                + (Row.YearlyIncome_IsNull ? "" : Row.YearlyIncome.ToString())
                + (Row.TotalChildren_IsNull ? "" : Row.TotalChildren.ToString())
                + (Row.NumberChildrenAtHome_IsNull ? "" : Row.NumberChildrenAtHome.ToString())
                + (Row.EnglishEducation_IsNull ? "" : Row.EnglishEducation)
                + (Row.SpanishEducation_IsNull ? "" : Row.SpanishEducation)
                + (Row.FrenchEducation_IsNull ? "" : Row.FrenchEducation)
                + (Row.EnglishOccupation_IsNull ? "" : Row.EnglishOccupation)
                + (Row.SpanishOccupation_IsNull ? "" : Row.SpanishOccupation)
                + (Row.FrenchOccupation_IsNull ? "" : Row.FrenchOccupation)
                + (Row.HouseOwnerFlag_IsNull ? "" : Row.HouseOwnerFlag)
                + (Row.NumberCarsOwned_IsNull ? "" : Row.NumberCarsOwned.ToString())
                + (Row.AddressLine1_IsNull ? "" : Row.AddressLine1)
                + (Row.AddressLine2_IsNull ? "" : Row.AddressLine2)
                + (Row.Phone_IsNull ? "" : Row.Phone)
                + (Row.DateFirstPurchase_IsNull ? "" : Row.DateFirstPurchase.ToString())
                + (Row.CommuteDistance_IsNull ? "" : Row.CommuteDistance)
                ;
 
            byte[] hashBytes = _md5.ComputeHash(
                System.Text.UnicodeEncoding.Unicode.GetBytes(hashSource));
 
            System.Text.StringBuilder sb = new System.Text.StringBuilder();
 
            for (int i = 0; i < hashBytes.Length; i++)
            {
                sb.Append(hashBytes[i].ToString("X2"));
            }
 
            Row.MD5 = "0x" + sb.ToString();
        }
        catch (Exception e)
        {
            Row.MD5 = "Error";
        }
    }
}

To quickly generate the string concatenation statement:

  1. Script DimCustomer As SELECT to Clipboard and paste in to Notepad++
  2. Search > Replace... (Ctrl+H)
  3. Select Regular expression for Search Mode
  4. Find: ^.*?\[(.*?)\].*$
  5. Replace: + (Row.\1_IsNull ? "" : Row.\1)
  6. Replace All, remove the leading +, and remove the garbage at the bottom, remove surrogate and business keys
  7. Paste results in to the script
  8. Add .ToString() to non-string columns and fix any column names (issues will be underlined in red)

Copy your completed script component once for each thread you'll be running. When you copy a script component, you need to actually open the script after copying it and click Edit Script before it will validate.

2.5: Union All to Squish it all back together
2.6: Lookup to get the existing MD5 Hash (if it exists)

Configure the Lookup component to redirect rows with no matches to the No Match output:

For your connection, select the surrogate key, business key(s), and MD5 hash from your existing dimension table:

Lookup based on the business key(s) and bring back the existing MD5 and surrogate key

2.7: Conditional Split to separate Unchanged and Changed rows

Name the output for rows where the ExistingMD5 is the same as the new MD5 "Ignore" and name the Default output "Update"

2.8: RowCount Transformation for Ignored Rows

For your Ignored Rows Row Count transformation, you'll have to create a variable to store the rowcount in (we don't use it, but that's just how the transformation works):

2.9: OLE DB Destination for Changed rows

We'll insert rows that need to be updated in to the previously created Staging_DimCustomer_UpdatedRows table:

2.10: OLE DB Destination for New rows

Finally, insert new rows directly in to the dimension:

3: Execute SQL Task to Update changed rows

The final step is to take those rows we staged for updating and actually perform the update:

UPDATE a SET
 [GeographyKey] = b.[GeographyKey]
,[CustomerAlternateKey] = b.[CustomerAlternateKey]
,[Title] = b.[Title]
,[FirstName] = b.[FirstName]
,[MiddleName] = b.[MiddleName]
,[LastName] = b.[LastName]
,[NameStyle] = b.[NameStyle]
,[BirthDate] = b.[BirthDate]
,[MaritalStatus] = b.[MaritalStatus]
,[Suffix] = b.[Suffix]
,[Gender] = b.[Gender]
,[EmailAddress] = b.[EmailAddress]
,[YearlyIncome] = b.[YearlyIncome]
,[TotalChildren] = b.[TotalChildren]
,[NumberChildrenAtHome] = b.[NumberChildrenAtHome]
,[EnglishEducation] = b.[EnglishEducation]
,[SpanishEducation] = b.[SpanishEducation]
,[FrenchEducation] = b.[FrenchEducation]
,[EnglishOccupation] = b.[EnglishOccupation]
,[SpanishOccupation] = b.[SpanishOccupation]
,[FrenchOccupation] = b.[FrenchOccupation]
,[HouseOwnerFlag] = b.[HouseOwnerFlag]
,[NumberCarsOwned] = b.[NumberCarsOwned]
,[AddressLine1] = b.[AddressLine1]
,[AddressLine2] = b.[AddressLine2]
,[Phone] = b.[Phone]
,[DateFirstPurchase] = b.[DateFirstPurchase]
,[CommuteDistance] = b.[CommuteDistance]
,[MD5] = b.[MD5] 
FROM DimCustomer a
	 INNER JOIN Staging_DimCustomer_UpdatedRows b
		ON a.CustomerKey = b.CustomerKey

To quickly generate the update statement:

  1. Script DimCustomer As SELECT to Clipboard and paste in to Notepad++
  2. Search > Replace... (Ctrl+H)
  3. Select Regular expression for Search Mode
  4. Find: ^.*?\[(.*?)\].*$
  5. Replace: ,[\1] = b.[\1]
  6. Replace All, remove surrogate key and garbage at the bottom, remove leading comma
  7. Add UPDATE a SET at the top
  8. Add appropriate FROM clause at the bottom

And bob's your uncle.

In closing...

Why not write a reusable code block for the MD5 hash that you could cut and paste? Why hardcode columns?
Well, there's really only two ways to do it dynamically. One is to loop over each column in the InputBuffer and concatenate them. Man, that's a bunch of extra logic when we know very well what columns are, null handling is a little sketch, and proper datatype handling requires some special logic. Looping over the InputBuffer also just plain feels dirty. The second option is Reflection. Now we could approach reflection two ways - first we could iterate over the Row definition and concatenate the columns for each InputRow, but this is INCREDIBLY expensive (reflection is crazy slow). The second Reflection approach would be to construct a DynamicMethod at runtime and call that dynamic method for each row. While this could certainly be as fast as hardcoding, it's very obscure and difficult for most people to understand. Do you want code in your packages that's hard to understand?

We chose hardcode the columns because this is the absolute minimum amount of work that needs to be done to generate a reliable MD5 hash and is therefore the fastest possible solution.

Why not do that Reflection and DynamicMethod thing in a custom transformation?
Because one of the main goals here was to rid ourselves of 3rd party dependencies (all custom assembly dependencies really) and we wanted to do the minimum amount of work required to generated a reliable MD5 hash.

Why bulk update at the end rather than update each row via OLE DB Command?
Because the OLE DB Command isn't set based and it can block/be blocked by inserts. Also, parameter mapping to all those Param_XX parameters and counting question marks is just terrible. So, so terrible.

How do I know how many threads I need to distribute the MD5 calculation over?
First we'll usually watch the source query in the database engine to see what it's waiting on. PREEMPTIVE_OS_WAITFORSINGLEOBJECT tends to point at SQL Server waiting to send data to SSIS because it's not accepting rows fast enough. For a starting point, take your source query and see how long it takes to just insert the entire thing in to a heap - somewhere around there is what we're shooting for. Faster if you ignore many rows, slower if you update/insert many rows.

How does it perform?
It's ridiculously fast. One of our dimensions (55 columns, varying datatypes, ~5 million deep, ~4.5 million evaluated for changes 3 times a day) went from 50+ minutes per run to 5-9 minutes per run. More importantly, we were previously using the CRC32 algorithm in the Checksum transform and that wasn't producing reliable checksums so we would either update unchanged rows or even miss changes to rows - now the hashes are dead on and reproducible via HASHBYTES.

As a side note, we were able to get Checksum to keep up with sql server by spreading it across 6 threads but the lack of reliable output forced us to abandon it.

Is there a quick way to mimic the SSIS MD5 using HASHBYTES?

  1. Script DimCustomer As SELECT to Clipboard and paste in to Notepad++
  2. Search > Replace... (Ctrl+H)
  3. Select Regular expression for Search Mode
  4. Find: ^.*?\[(.*?)\].*$
  5. Replace: +ISNULL(CAST([\1] AS NVARCHAR(255)),N'')
  6. Replace All, remove surrogate key, business key(s), leading +, and garbage at the bottom
  7. Add CONVERT(VARCHAR(34), HASHBYTES('md5', at the top
  8. Add ),1) AS [Hashbytes] and an appropriate FROM clause at the bottom

Now if all your columns are chars, nchars, varchars, nvarchars, ints, and decimals - you're good. If some of your columns are the more unique datatypes (datetime, money, bit, etc) - then you've got more work to do. The issue lies in the difference between how sql server converts these datatypes to NVARCHAR and how C# converts them to strings. Here are a few examples:

Datatype SQL Conversion C# Conversion
DATETIME 1966-04-08 00:00:00.000 4/8/1966 12:00:00 AM
MONEY 70000.00 70000
BIT 1 True
BIT 0 False

The dates are the biggest debacle as they're sensitive to localization settings. If you actually had to generate the hashes using HASHBYTES on a regular basis, we would recommend modifying the date formatting when concatenating columns for the hashSource variable in the script so they match sql server. For us, we only generate MD5s using HASHBYTES for troubleshooting and updating the hash using purely dim columns rather than relying on the source as some business keys no longer exist in the source.

Here's the finished tsql for the above hash:

SELECT 
 CustomerKey
,CustomerAlternateKey
,MD5
,CONVERT(VARCHAR(34), HASHBYTES('md5',
ISNULL(CAST([GeographyKey] AS NVARCHAR(255)),N'')
+ISNULL(CAST([Title] AS NVARCHAR(255)),N'')
+ISNULL(CAST([FirstName] AS NVARCHAR(255)),N'')
+ISNULL(CAST([MiddleName] AS NVARCHAR(255)),N'')
+ISNULL(CAST([LastName] AS NVARCHAR(255)),N'')
+ISNULL(CASE WHEN [NameStyle] = 1 THEN N'True' ELSE N'False' END,N'')
+ISNULL(
CAST(DATEPART(MONTH,[BirthDate]) AS NVARCHAR(2))
+'/'+CAST(DATEPART(DAY,[BirthDate]) AS NVARCHAR(2))
+'/'+CAST(DATEPART(YEAR,[BirthDate]) AS NVARCHAR(4))
+' '+LEFT(RIGHT(CONVERT(NVARCHAR, [BirthDate], 109), 14),8)
+' '+RIGHT(CONVERT(NVARCHAR, [BirthDate], 109), 2)
,N'')
+ISNULL(CAST([MaritalStatus] AS NVARCHAR(255)),N'')
+ISNULL(CAST([Suffix] AS NVARCHAR(255)),N'')
+ISNULL(CAST([Gender] AS NVARCHAR(255)),N'')
+ISNULL(CAST([EmailAddress] AS NVARCHAR(255)),N'')
+ISNULL(CAST(CAST(YearlyIncome AS FLOAT) AS NVARCHAR(255)),N'')
+ISNULL(CAST([TotalChildren] AS NVARCHAR(255)),N'')
+ISNULL(CAST([NumberChildrenAtHome] AS NVARCHAR(255)),N'')
+ISNULL(CAST([EnglishEducation] AS NVARCHAR(255)),N'')
+ISNULL(CAST([SpanishEducation] AS NVARCHAR(255)),N'')
+ISNULL(CAST([FrenchEducation] AS NVARCHAR(255)),N'')
+ISNULL(CAST([EnglishOccupation] AS NVARCHAR(255)),N'')
+ISNULL(CAST([SpanishOccupation] AS NVARCHAR(255)),N'')
+ISNULL(CAST([FrenchOccupation] AS NVARCHAR(255)),N'')
+ISNULL(CAST([HouseOwnerFlag] AS NVARCHAR(255)),N'')
+ISNULL(CAST([NumberCarsOwned] AS NVARCHAR(255)),N'')
+ISNULL(CAST([AddressLine1] AS NVARCHAR(255)),N'')
+ISNULL(CAST([AddressLine2] AS NVARCHAR(255)),N'')
+ISNULL(CAST([Phone] AS NVARCHAR(255)),N'')
+ISNULL(
CAST(DATEPART(MONTH,DateFirstPurchase) AS NVARCHAR(2))
+'/'+CAST(DATEPART(DAY,DateFirstPurchase) AS NVARCHAR(2))
+'/'+CAST(DATEPART(YEAR,DateFirstPurchase) AS NVARCHAR(4))
+' '+LEFT(RIGHT(CONVERT(NVARCHAR, DateFirstPurchase, 109), 14),8)
+' '+RIGHT(CONVERT(NVARCHAR, DateFirstPurchase, 109), 2)
,N'')
+ISNULL(CAST([CommuteDistance] AS NVARCHAR(255)),N'')
),1) AS [Hashbytes]
FROM DimCustomer

And the results:

Comments (4) Trackbacks (2)
  1. Nice work, I had to rebuild it under 2005 version (visual basic) :-), but it works. Thx for inspiration. Peter

  2. Hi, thanks for this article. Interesting reading, the lack of hashing component is a little annoying to me. The hashing script component is very nice.

    I have coded a UDF in SQL which takes the table and schema names and generates a concatenation string for the HASHBYTES function. It works OK but is not integrated with the SSIS data flow.

  3. Sorry, the phrase in my last comment should have been: “the lack of hashing component [in SSIS] is a little annoying to me.”

  4. I used this today. 4 years later in the sql 2016 stack. Didn’t really need any modification other than some quick changes to the regular expressions for generating the hashSource string (maybe just changes in n++ regex syntax). I skipped the multithreading stuff, didn’t need it for this smaller dim, but I did want to try out that multicast transform – another day maybe.


Leave a comment