tim laqua dot com Thoughts and Code from Tim Laqua

8Jun/110

Revisiting Embedded Text Qualifiers and the SSIS Flat File Connection Manager

200px-Gnome-face-angry

To quickly summarize the problem at hand, the Flat File Connection Manager doesn't support embedded Text Qualifiers. The reason this is a big issue is because the most commonly used text qualifier is a double quote (") and that's also a commonly used punctuation mark. That means if you have a CSV to import and lines that looks like this:

"one","1","two","2","th""r""ee","3","asdf"
"""left","right""","mid""dle","w"",tf","w,""tf",",""omg","omg"","

You would expect to import

one 1 two 2 th"r"ee 3 asdf
"left right" mid"dle w",tf w,"tf ,"omg omg",

That's not what you get, you get an error. I've seen a few different approaches to working around this known issue:

What I'd like to explore in this post is using regular expressions to transform the entire file in to something that can be natively consumed by SSIS. There are two options for dealing with this:

  • Tokenizing the quotation marks found within the cells
  • Change the text qualifier

I chose to change the text qualifier because SSIS natively supports multiple character text qualifiers. This will allow me to pick something ridiculous that is incredibly unlikely to appear in the data naturally. Changing the text qualifier also means we won't have to add any extra transformations as we would if we tokenized the embeded quotation marks (to turn them back in to quotation marks).

Let's start out with the file we're importing so we have some text to test against (C:\SSISTextQualifier\Input.csv)

"one","1","two","2","th""r""ee","3","asdf"
"""left","right""","mid""dle","w"",tf","w,""tf",",""omg","omg"","

Now let's iron out the regular expression we're going to need.

We'll start out with grabbing a bunch of stuff between two double-quotes




Ok, we don't just want a bunch of stuff - we want the stuff between each set of quotation marks - we make the * quantifier lazy with a question mark




Better... but that th""r""ee is clearly wrong... we have to stop the final quotation mark from matching properly escaped quotation marks. We can do that with a negative lookahead:




That's different - notice that now properly escaped quotation marks aren't abnormally terminating the column match anymore, now they're just acting like normal characters. We have to make a special exception for escaped quotation marks. Those have to essentially count as a single character:




Perfect - Now we can replace the text qualifiers with something awesome like |~| since we have matched the column content safely

Now that we have our regex, we just have to decide how to execute.

Plan A: SSIS Script Task

Create a new package with a single script task in it

Now edit the script (C#) and add the following up by the existing using statements:

using System.IO;
using System.RegularExpressions;

Modify the Main method to look something like this (depending on your paths):

        public void Main()
        {
            StreamReader sr = new System.IO.StreamReader(@"C:\SSISTextQualifier\Input.csv");
            string allText = sr.ReadToEnd();
            sr.Close();
            StreamWriter sw = new StreamWriter(@"C:\SSISTextQualifier\InputCleaned.csv");
            sw.Write(
                    Regex.Replace(
                        allText, 
                        @"""((?:""""|.)*?)""(?!"")", 
                        @"|~|$1|~|"
                    ).Replace("\"\"", "\"")
            );
            sw.Close();
            Dts.TaskResult = (int)ScriptResults.Success;
        }

Then run that task and you will have a new file called C:\SSISTextQualifier\InputCleaned.csv that's text qualified by |~| and the escaped double quotes have been replaced with a single double quote.

Plan B: Powershell

Since we often call SSIS packages from the SQL Server Agent, we can easily add a Powershell step right before the SSIS package to copy and cleanse our file. If that's the route you choose, your powershell step will look something like this:

gc C:\SSISTextQualifier\Input.csv | 
% {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} | 
sc C:\SSISTextQualifier\InputCleaned2.csv

Final Steps

However you got your data in to a new file (C:\SSISTextQualifier\InputCleaned.csv in my case) with a new text qualifier, you still have to then import the file. This is done using our old friend the Flat File Connection Manager with the source being the newly minted clean file and with our new text qualifier set as |~|

And that's pretty much it:

Conclusion

Personally, I think the Powershell plan is the cleanest and easiest to implement. The code isn't hidden away in a script task, and it looks cool. Doing it this way allows us to kind of abstract this "file fixing" from the SSIS data flows and let them do what they do best.

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

(required)

No trackbacks yet.