February « 2012 « tim laqua dot com

tim laqua dot com Thoughts and Code from Tim Laqua


Slowly Changing Dimensions with MD5 Hashes in SSIS

We recently moved away from the 3rd party Checksum component (and all 3rd party components) in SSIS and I wanted to share the pattern we settled on for maintaining our Type 1 Slowly Changing Dimensions (SCDs). There are two things we wanted to address with our new pattern. First, our previous implementation wasn't performing as well as we needed it to or generating reliable checksums. The second was that we wanted to get away from dependencies on custom assemblies in general. To illustrate the pattern, we're going to build a SCD package off the Adventure Works DW DimCustomer table and skip over the actual source of the business keys and attributes by selecting directly from the completed dimension for now.

First, we assume that our dimension already exists (and we were using some other checksum or MERGE to maintain it). We have to add a column to store the MD5 hash:

ALTER TABLE dbo.DimCustomer ADD

Second, we need a staging table to store updated/changed rows. Script out the current dimension as a CREATE, remove all unneeded constraints and indexes, and create a staging table as a heap:

CREATE TABLE [dbo].[Staging_DimCustomer_UpdatedRows](
	[CustomerKey] [int] NOT NULL,
	[GeographyKey] [int] NULL,
	[CustomerAlternateKey] [nvarchar](15) NOT NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[NameStyle] [bit] NULL,
	[BirthDate] [datetime] NULL,
	[MaritalStatus] [nchar](1) NULL,
	[Suffix] [nvarchar](10) NULL,
	[Gender] [nvarchar](1) NULL,
	[EmailAddress] [nvarchar](50) NULL,
	[YearlyIncome] [money] NULL,
	[TotalChildren] [tinyint] NULL,
	[NumberChildrenAtHome] [tinyint] NULL,
	[EnglishEducation] [nvarchar](40) NULL,
	[SpanishEducation] [nvarchar](40) NULL,
	[FrenchEducation] [nvarchar](40) NULL,
	[EnglishOccupation] [nvarchar](100) NULL,
	[SpanishOccupation] [nvarchar](100) NULL,
	[FrenchOccupation] [nvarchar](100) NULL,
	[HouseOwnerFlag] [nchar](1) NULL,
	[NumberCarsOwned] [tinyint] NULL,
	[AddressLine1] [nvarchar](120) NULL,
	[AddressLine2] [nvarchar](120) NULL,
	[Phone] [nvarchar](20) NULL,
	[DateFirstPurchase] [datetime] NULL,
	[CommuteDistance] [nvarchar](15) NULL,
	[MD5] [varchar](34) NOT NULL)

Now in to SSIS - We will be building:

  1. Execute SQL Task to Truncate our Staging table(s)
  2. Data Flow Task to Insert new rows and Stage updated rows
  1. OLE DB Source to retrieve our source data
  2. Script Component to Generate Row Numbers
  3. Conditional Split to Evenly Distribute Rows
  4. Script Component to Generate MD5 Hashes
  5. Union All to Squish it all back together
  6. Lookup to get the existing MD5 Hash (if it exists)
  7. Conditional Split to separate Unchanged and Changed rows
  8. RowCount Transformation
  9. OLE DB Destination for Changed rows
  10. OLE DB Destination for New rows
  • Execute SQL Task to Update changed rows
  • Completed Control Flow

    Completed Data Flow


    SSAS Cache Isn’t Making Cents

    I stole the pun from my Business Analyst, Mr. John Seiler 😉 Now on to my issue - when SSAS caches the value for my [Actual] measure, it seems to do so based on the results of the first query that requests that coordinate. In this particular cube, there's bunches of tiny fractions and depending on how you slice it, it aggregates a little different. This is a fun problem in itself, but the part that drives me (and the Finance department) crazy is that if you go and slice on something OTHER than that first query that created the cache, the values they see don't always add up to the "Grand Total" in Excel - aka. "All"

    These are the queries used for this test:

    Query A

    	{Actual} ON 0
    FROM [GL] 

    Query B

    	{Actual} ON 0,
    	NON EMPTY Hierarchize({DrilldownLevel({[Account].[Accounts].[All]},,,INCLUDE_CALC_MEMBERS)}) ON 1
    FROM [GL] 

    Query C

    	{Actual} ON 0,
    	NON EMPTY Hierarchize({DrilldownLevel({[Date].[Fiscal].[All]},,,INCLUDE_CALC_MEMBERS)}) ON 1
    FROM [GL] 

    Results (Cache was cleared with ClearCache on the entire Database before each Series)

    Series 1 Series 2 Series 3

    So basically the Grand Total of this GL cube is a random number based on whoever sneaks the first query in after the cache is cleared (processing, sync, etc).

    And for all of you that think the MDX script breaks everything everywhere - I did comment out my entire MDX script before running these tests.


    Uncivil UNIONs

    If you asked any SQL Server Developer or DBA what UNION does, they would tell you it combines two result sets. Then if you were to ask them if they knew it removed duplicates without the ALL argument, they would say yes. That's academic, but I dare you to go look at their and your code and see how often UNION without the ALL argument is used. UNION is evil. Why do I say that? Because DISTINCT is evil and blocking operations are evil so it follows that UNION is evil.

    Would you be OK with the following query as the source for a fact table?

      SELECT DISTINCT [c1],[c2] FROM [t1]
      SELECT DISTINCT [c1],[c2] FROM [t2]
    ) a

    No? Then why are you OK with this?

    SELECT [c1],[c2] FROM [t1]
    SELECT [c1],[c2] FROM [t2]

    Personally, I tend to use UNION to combine data from two different sources. For example, if I wanted to combine the sales transactions from two JDE instances - the query might look like:

    SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119]
    SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]

    Each line here in a sales detail represents a line on an order, so assuming you selected all the PK columns in your query, a UNION won't remove any rows within either set, but when combined - it's possible the same PK could exist in both systems. Now you could split hairs and say it's highly unlikely the rest of the row would be identical, but that's not the point - it's wrong to ever DISTINCT a set of transactions intended for creating a fact table. Sure, in this particular case you could add a "SourceSystem" column that would stop that, but then you still have to accept that the UNION is blocking and expensive. Just because a DISTINCT doesn't modify results doesn't mean it didn't do any work - it actually did a great deal of work. Now consider that we're usually not selecting 2 columns, we're selecting 10-20. Do you want to wait for it to figure out all the rows are distinct when you already knew that? Further, if they're not distinct you probably screwed up a JOIN somewhere.

    As far as coding standards regarding UNION go, I would suggest never using UNION without the ALL clause and doing your DISTINCT elsewhere if that's actually what you intended. Does that seem silly? Yes - but I would wager you won't ever use the 2nd one outside of building Dimensions.

    SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119]
    SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]
      SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119]
      SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]

    And don't get me started on unioning subqueries with GROUP BY's - Why don't you just throw half your processers in the garbage, light tempdb on fire and put all your data in heaps.

    Tagged as: , 3 Comments