tim laqua dot com Thoughts and Code from Tim Laqua

23May/093

Maintaining a Type 1 Slowly Changing Dimension (SCD) using T-SQL

UPDATE 2012-03-16: Please also take a look at Slowly Changing Dimensions with MD5 Hashes in SSIS which we have determined to be fastest, most efficient approach to maintaining Type 1 dimensions.

A few days ago, one of our SSIS packages that maintained a Type 1 Slowly Changing Dimension (SCD) of about 1 million rows crept up to 15 minutes of runtime. Now this doesn't sound too bad, but this is part of our hourly batches, so 15 minutes is 25% of our entire processing window. The package was using the Slowly Changing Dimension Wizard transformation - we were doing the standard OLEDB Source (which basically represented how the SCD "should" look) and then sending it to the SCD transform and letting it figure out what needed to be inserted and updated. One option was to switch to lookups instead of the SCD wizard to speed things up, maybe even some fancy checksum voodoo for the updates (see http://blog.stevienova.com/2008/11/22/ssis-slowly-changing-dimensions-with-checksum/ for an example). Then after thinking about it a little more - why are we sending a million rows down the pipeline every hour? We know only a small percentage of these are new - and another small percentage needs to be updated. Well, we can just write a quick SQL query to get us just those sets and the package would be much more efficient!

Wait a tick - why would we give the rows to SSIS if all it is going to do insert one set and update the other? Let's just do it all in T-SQL:

The following tables and dim are fictional - I just need to make up a star schema DIM to illustrate the approach

CREATE TABLE #Temp_SCDInserts ([PartId] INT)
 
-- INSERTS for new Parts
INSERT INTO #Temp_SCDInserts
SELECT   
	[PartId]
FROM 
	Part a WITH(NOLOCK)
	INNER JOIN PartType b WITH(NOLOCK) 
		ON a.PartTypeId = b.PartTypeId
	INNER JOIN PartSupplier c WITH(NOLOCK) 
		ON a.PartSupplierId = c.PartSupplierId
	INNER JOIN PartSupplierCategory d WITH(NOLOCK) 
		ON c.PartSupplierCategoryId = d.PartSupplierCategoryId
EXCEPT
SELECT
	[PartId]	
FROM
	DW.dbo.Dim_Part WITH(NOLOCK)
 
-- UPDATES
SELECT 
	 dim.[PartKeyId]
	,a.[PartName]
	,b.[PartTypeId]
	,b.[PartTypeName]
	,c.[PartSupplierId]
	,c.[PartSupplierName]
	,d.[PartSupplierCategoryId]
	,d.[PartSupplierCategoryName]
INTO #Temp_SCDUpdates
FROM DW.dbo.Dim_Part dim WITH(NOLOCK)
	INNER JOIN Part a WITH(NOLOCK)
		ON dim.PartId = a.PartId -- Business Key
	INNER JOIN PartType b WITH(NOLOCK) 
		ON a.PartTypeId = b.PartTypeId
	INNER JOIN PartSupplier c WITH(NOLOCK) 
		ON a.PartSupplierId = c.PartSupplierId
	INNER JOIN PartSupplierCategory d WITH(NOLOCK) 
		ON c.PartSupplierCategoryId = d.PartSupplierCategoryId
WHERE
	dim.PartName <> a.PartName
	OR dim.PartTypeId <> b.PartTypeId
	OR dim.PartTypeName <> b.PartTypeName
	OR dim.PartSupplierId <> c.PartSupplierId
	OR dim.PartSupplierName <> c.PartSupplierName
	OR dim.PartSupplierCategoryId <> d.PartSupplierCategoryId
	OR dim.PartSupplierCategoryName	<> d.PartSupplierCategoryName
 
-- INSERT new records
INSERT INTO [DW].[dbo].[Dim_Part]
	([PartId]
	,[PartName]
	,[PartTypeId]
	,[PartTypeName]
	,[PartSupplierId]
	,[PartSupplierName]
	,[PartSupplierCategoryId]
	,[PartSupplierCategoryName])
SELECT 
	 a.[PartId]
	,a.[PartName]
	,b.[PartTypeId]
	,b.[PartTypeName]
	,c.[PartSupplierId]
	,c.[PartSupplierName]
	,d.[PartSupplierCategoryId]
	,d.[PartSupplierCategoryName]
FROM #Temp_SCDInserts i
	INNER JOIN Part a WITH(NOLOCK)
		ON i.PartId = a.PartId -- Business Key
	INNER JOIN PartType b WITH(NOLOCK) 
		ON a.PartTypeId = b.PartTypeId
	INNER JOIN PartSupplier c WITH(NOLOCK) 
		ON a.PartSupplierId = c.PartSupplierId
	INNER JOIN PartSupplierCategory d WITH(NOLOCK) 
		ON c.PartSupplierCategoryId = d.PartSupplierCategoryId
 
-- UPDATE existing records
UPDATE [DW].[dbo].[Dim_Part]
SET
	 [PartName] = b.[PartName]
	,[PartTypeId] = b.[PartTypeId]
	,[PartTypeName] = b.[PartTypeName]
	,[PartSupplierId] = b.[PartSupplierId]
	,[PartSupplierName] = b.[PartSupplierName]
	,[PartSupplierCategoryId] = b.[PartSupplierCategoryId]
	,[PartSupplierCategoryName] = b.[PartSupplierCategoryName]
FROM 
	[DW].[dbo].[Dim_Part] AS a
	INNER JOIN #Temp_SCDUpdates AS b 
		ON a.PartKeyId = b.PartKeyId  -- Join on Business Keys or Surrogate Key
 
 
DROP TABLE #Temp_SCDInserts
DROP TABLE #Temp_SCDUpdates

The results? Our 15 minute SSIS package has been replaced with a few lines of T-SQL and it now runs in less than 90 seconds. Nice.

Comments (3) Trackbacks (0)
  1. I have used EXCEPT a couple of times in Dim ETL’s. It breaks down when your source is on a different machine, or not even a table (say, an excel file, or text file, or another db system) – yeah, you can bring it forward one step with SQL to a staging, but that adds time depending on how much data, etc.

  2. What about dont test changed columns (dim.PartTypeId b.PartTypeId OR dim.PartTypeName b.PartTypeName …) and just update all columns?

    I thing it could be more quickly and less code

    • I see your point – similar to when you run a MERGE and just update on WHEN MATCHED without checking for changed columns, it fires updates against every single row. The original reason i was coded like that was from a pipeline perspective – not moving/touching data you dont’ need to move/touch.

      Knowing if you actually had to update rows or if nothing changed came in to play in a later implementation where we run a ProcessAdd against the dimension if all we had were inserts, no updates, only ProcessUpdate if we had Updates, and do absolutely nothing to the dimension if we had no Inserts or Updates.

      This can also be accomplished using MERGE if you constrain the WHEN MATCHED block to only rows that did have changes and output inserted.* and $action to a temp table (I’ll blog about that later on).

      Though, I could easily be convinced to just update blindly if someone could prove it was faster 😉


Leave a comment

No trackbacks yet.