Slowly Changing Dimensions with MD5 Hashes in SSIS

We recently moved away from the 3rd party Checksum component (and all 3rd party components) in SSIS and I wanted to share the pattern we settled on for maintaining our Type 1 Slowly Changing Dimensions (SCDs). There are two things we wanted to address with our new pattern. First, our previous implementation wasn't performing as well as we needed it to or generating reliable checksums. The second was that we wanted to get away from dependencies on custom assemblies in general. To illustrate the pattern, we're going to build a SCD package off the Adventure Works DW DimCustomer table and skip over the actual source of the business keys and attributes by selecting directly from the completed dimension for now.
First, we assume that our dimension already exists (and we were using some other checksum or MERGE to maintain it). We have to add a column to store the MD5 hash:
ALTER TABLE dbo.DimCustomer ADD MD5 VARCHAR(34) NOT NULL DEFAULT ''
Second, we need a staging table to store updated/changed rows. Script out the current dimension as a CREATE, remove all unneeded constraints and indexes, and create a staging table as a heap:
CREATE TABLE [dbo].[Staging_DimCustomer_UpdatedRows]( [CustomerKey] [INT] NOT NULL, [GeographyKey] [INT] NULL, [CustomerAlternateKey] [NVARCHAR](15) NOT NULL, [Title] [NVARCHAR](8) NULL, [FirstName] [NVARCHAR](50) NULL, [MiddleName] [NVARCHAR](50) NULL, [LastName] [NVARCHAR](50) NULL, [NameStyle] [BIT] NULL, [BirthDate] [DATETIME] NULL, [MaritalStatus] [NCHAR](1) NULL, [Suffix] [NVARCHAR](10) NULL, [Gender] [NVARCHAR](1) NULL, [EmailAddress] [NVARCHAR](50) NULL, [YearlyIncome] [MONEY] NULL, [TotalChildren] [TINYINT] NULL, [NumberChildrenAtHome] [TINYINT] NULL, [EnglishEducation] [NVARCHAR](40) NULL, [SpanishEducation] [NVARCHAR](40) NULL, [FrenchEducation] [NVARCHAR](40) NULL, [EnglishOccupation] [NVARCHAR](100) NULL, [SpanishOccupation] [NVARCHAR](100) NULL, [FrenchOccupation] [NVARCHAR](100) NULL, [HouseOwnerFlag] [NCHAR](1) NULL, [NumberCarsOwned] [TINYINT] NULL, [AddressLine1] [NVARCHAR](120) NULL, [AddressLine2] [NVARCHAR](120) NULL, [Phone] [NVARCHAR](20) NULL, [DateFirstPurchase] [DATETIME] NULL, [CommuteDistance] [NVARCHAR](15) NULL, [MD5] [VARCHAR](34) NOT NULL)
Now in to SSIS - We will be building:
- Execute SQL Task to Truncate our Staging table(s)
- Data Flow Task to Insert new rows and Stage updated rows
- OLE DB Source to retrieve our source data
- Script Component to Generate Row Numbers
- Conditional Split to Evenly Distribute Rows
- Script Component to Generate MD5 Hashes
- Union All to Squish it all back together
- Lookup to get the existing MD5 Hash (if it exists)
- Conditional Split to separate Unchanged and Changed rows
- RowCount Transformation
- OLE DB Destination for Changed rows
- OLE DB Destination for New rows
Uncivil UNIONs
If you asked any SQL Server Developer or DBA what UNION does, they would tell you it combines two result sets. Then if you were to ask them if they knew it removed duplicates without the ALL argument, they would say yes. That's academic, but I dare you to go look at their and your code and see how often UNION without the ALL argument is used. UNION is evil. Why do I say that? Because DISTINCT is evil and blocking operations are evil so it follows that UNION is evil.
Would you be OK with the following query as the source for a fact table?
SELECT DISTINCT * FROM ( SELECT DISTINCT [c1],[c2] FROM [t1] UNION ALL SELECT DISTINCT [c1],[c2] FROM [t2] ) a
No? Then why are you OK with this?
SELECT [c1],[c2] FROM [t1] UNION SELECT [c1],[c2] FROM [t2]
Personally, I tend to use UNION to combine data from two different sources. For example, if I wanted to combine the sales transactions from two JDE instances - the query might look like:
SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119] UNION SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]
Each line here in a sales detail represents a line on an order, so assuming you selected all the PK columns in your query, a UNION won't remove any rows within either set, but when combined - it's possible the same PK could exist in both systems. Now you could split hairs and say it's highly unlikely the rest of the row would be identical, but that's not the point - it's wrong to ever DISTINCT a set of transactions intended for creating a fact table. Sure, in this particular case you could add a "SourceSystem" column that would stop that, but then you still have to accept that the UNION is blocking and expensive. Just because a DISTINCT doesn't modify results doesn't mean it didn't do any work - it actually did a great deal of work. Now consider that we're usually not selecting 2 columns, we're selecting 10-20. Do you want to wait for it to figure out all the rows are distinct when you already knew that? Further, if they're not distinct you probably screwed up a JOIN somewhere.
As far as coding standards regarding UNION go, I would suggest never using UNION without the ALL clause and doing your DISTINCT elsewhere if that's actually what you intended. Does that seem silly? Yes - but I would wager you won't ever use the 2nd one outside of building Dimensions.
SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119] UNION ALL SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119] --OR SELECT DISTINCT * FROM ( SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119] UNION ALL SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119] )
And don't get me started on unioning subqueries with GROUP BY's - Why don't you just throw half your processers in the garbage, light tempdb on fire and put all your data in heaps.
Pattern for Conditionally Sending SSRS Subscriptions

This is a pretty common request - send the report if there's an issue, don't send it if there's not an issue. This is a simple pattern for doing that using Data Driven Subscriptions. Here, we're not using the data driven subscription to dynamically populate recipient and parameter information, we're just using it to suppress the sending of the report if the report has no value.
To do this, we only need one thing - a query that returns one row if the report should be sent and zero rows if the report should not be sent. Here is an example - Let's say I want to send a report out if a JDE Address Book record with an AT1 of 'DC' has been updated after a certain julian date.
First, write a query to return rows meeting that criteria:
Troubleshooting: 1998 Jeep Grand Cherokee (ZJ) Doesn’t Start – Cranks, No Fuel, No Spark

This has been one of the most frustrating repairs we have come across. One day the Jeep just plain refused to start anymore. We have had prolonged starting times for a while now where we will let fuel pump prime at least twice before it would start, but it would always start. Nothing really precluded the no start condition, we just turned it off one day and it didn't want to start again. The symptoms don't really point to any particular component (other than the PCM) and without some fancy tools, you'll do exactly what we did - replace the PCM... and it won't fix anything. So we took out our fancy tool, looked up some diagnostic diagrams on alldatadiy.com, and got to work.
A few things to check for right away to see if you're having the same issue we were:
- Check to see if the fuel pump runs during cranking - ours didn't
- Check for spark during cranking - we didn't have any
- Check to see if your check engine light is on (ours WASN'T)
- Check your voltage and fuel gauges with the ignition on - ours didn't register anything
- Check to see if your low fuel light is on - our was
Analysis Services Writeback – Working with Weight Expressions
While writeback has been around for a while, it didn't really get easy to do until Excel 2010 was released because wrapping a UI around it was cumbersome. Now that we have a simple writeback UI via the What-If analysis dialogs in Excel, it's pry worth poking around at how to get things to allocate the way you want.
Here's our starting point in the Adventure Works cube (note, I commented out all the scope statements in the MDX script that distribute quarterly quotas/targets to months - in AW, quotas only exist in the data warehouse at the calendar quarter level):

VersionOne Nag: Reminding Us To Burn Our Points
You'll forget - we all do. After a hard day of work, the last thing that pops in to your head is "hey, I should go update this story in VersionOne before I go home." Oh no, the only thing you're thinking about is how bad traffic will be, what you're having for dinner, your trip to Seattle, which bar has the best drink specials tonight, etc. So accepting that we forget to burn points now and then, why don't we just make a script to remind us? While you're at it, might as well add a reminder for when you accidentally close a story without setting the ToDo points to zero.
Luckily, since you obviously use VersionOne (because it's the gold standard), there's an API for that:
Wouldn’t it be fun if Cubes could talk?
I didn't say "wouldn't it be useful" because after putting a test together, asking a cube questions with no context tends to return answers that it probably shouldn't have returned. In BI, it is incredibly important to understand what exactly it is you're asking for - if we just say we want "sales" and return an answer, nobody really knows what we meant by "sales." Sure, in various circles, "sales" means the same thing - but once you start talking to different areas, departments, etc - the meaning of the word starts to shift.
But I digress - asking cubes questions is still pretty fun and some of the random things it returns when you point it at your own cubes can be flat out hilarious.
Here's a few questions thrown at the Adventure Works cube in the Adventure Works DW 2008 Analysis Services database
Customer Service at the Golf Course (the customer is NOT always right)
In most businesses, the name of the game is giving the customer what they want when they want it. Putting the right opportunities in front of them at the right time. At the golf course (the elitist ones), the game changes quite a bit. Take the Pro Shop for example - everything is overpriced, at least 20% above what you could get it for at TGW or GolfGalaxy. Why? Because if you're in need of something before a round and you remember just as you're standing in the pro shop, you're certainly not going to hop in your car and head over to GolfGalaxy - you just buy it there, you're trapped more or less. Don't even bother trying to work with the pro shop guys on the price either, they could care less. Which brings me to my second point - the staff.
The Ellipsis … stop abusing it!
I have no idea how many times in the last week I've fell prey to the errant ellipsis. You see it in an instant message and then wait for the rest of the thought like a dog awaiting a promised treat. Alas, nothing follows. Do you ask for the rest of the thought? Do you wait and try to followup next time you see them in person? Or... Even worse... do they not know what an ellipsis means? For some reason more often than not lately I have been a victim of the "I just end sentences with multiple periods" syndrome. Stop it. You can not add extra periods to imply that you are super duper done with your statement. Sure, extra exclimation points are good fun - but extra periods like... mean something - they don't indicate extra done-ness.
Now, I'm certainly not a grammar specialist, but when I see an ellipsis, I get all excited waiting for the held back nugget of drama. Then when it doesn't come I am sad. So, so sad. So please. One period means "statement over," while more than one means "more awesome stuff to come."
Starting a Business = Divorce?
So I was trying to find some sort of "start a business" packet with all the required forms and such - so I naturally searched Google for "minnesota start a business packet"
And what is the 3rd ranked hit? Starting a divorce. Nice.








