tim laqua dot com Thoughts and Code from Tim Laqua

2Feb/123

Uncivil UNIONs

If you asked any SQL Server Developer or DBA what UNION does, they would tell you it combines two result sets. Then if you were to ask them if they knew it removed duplicates without the ALL argument, they would say yes. That's academic, but I dare you to go look at their and your code and see how often UNION without the ALL argument is used. UNION is evil. Why do I say that? Because DISTINCT is evil and blocking operations are evil so it follows that UNION is evil.

Would you be OK with the following query as the source for a fact table?

SELECT DISTINCT *
FROM
(
  SELECT DISTINCT [c1],[c2] FROM [t1]
  UNION ALL
  SELECT DISTINCT [c1],[c2] FROM [t2]
) a

No? Then why are you OK with this?

SELECT [c1],[c2] FROM [t1]
UNION
SELECT [c1],[c2] FROM [t2]

Personally, I tend to use UNION to combine data from two different sources. For example, if I wanted to combine the sales transactions from two JDE instances - the query might look like:

SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119]
UNION
SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]

Each line here in a sales detail represents a line on an order, so assuming you selected all the PK columns in your query, a UNION won't remove any rows within either set, but when combined - it's possible the same PK could exist in both systems. Now you could split hairs and say it's highly unlikely the rest of the row would be identical, but that's not the point - it's wrong to ever DISTINCT a set of transactions intended for creating a fact table. Sure, in this particular case you could add a "SourceSystem" column that would stop that, but then you still have to accept that the UNION is blocking and expensive. Just because a DISTINCT doesn't modify results doesn't mean it didn't do any work - it actually did a great deal of work. Now consider that we're usually not selecting 2 columns, we're selecting 10-20. Do you want to wait for it to figure out all the rows are distinct when you already knew that? Further, if they're not distinct you probably screwed up a JOIN somewhere.

As far as coding standards regarding UNION go, I would suggest never using UNION without the ALL clause and doing your DISTINCT elsewhere if that's actually what you intended. Does that seem silly? Yes - but I would wager you won't ever use the 2nd one outside of building Dimensions.

SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119]
UNION ALL
SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]
 
--OR
 
SELECT DISTINCT * FROM
(
  SELECT [c1],[c2] FROM [JDE1].[dbo].[F42119]
  UNION ALL
  SELECT [c1],[c2] FROM [JDE2].[dbo].[F42119]
)

And don't get me started on unioning subqueries with GROUP BY's - Why don't you just throw half your processers in the garbage, light tempdb on fire and put all your data in heaps.

Tagged as: , Leave a comment
Comments (3) Trackbacks (0)
  1. quote: “it’s wrong to ever DISTINCT a set of transactions intended for creating a fact table.”

    Oh, if only that were true 🙁

  2. Nice! Best part is: “Why don’t you just throw half your processers in the garbage, light tempdb on fire and put all your data in heaps.”


Leave a comment

No trackbacks yet.