tim laqua dot com Thoughts and Code from Tim Laqua

8Jun/1128

Revisiting Embedded Text Qualifiers and the SSIS Flat File Connection Manager

200px-Gnome-face-angry

To quickly summarize the problem at hand, the Flat File Connection Manager doesn't support embedded Text Qualifiers. The reason this is a big issue is because the most commonly used text qualifier is a double quote (") and that's also a commonly used punctuation mark. That means if you have a CSV to import and lines that looks like this:

"one","1","two","2","th""r""ee","3","asdf"
"""left","right""","mid""dle","w"",tf","w,""tf",",""omg","omg"","

You would expect to import

one 1 two 2 th"r"ee 3 asdf
"left right" mid"dle w",tf w,"tf ,"omg omg",

That's not what you get, you get an error. I've seen a few different approaches to working around this known issue:

What I'd like to explore in this post is using regular expressions to transform the entire file in to something that can be natively consumed by SSIS. There are two options for dealing with this:

  • Tokenizing the quotation marks found within the cells
  • Change the text qualifier

I chose to change the text qualifier because SSIS natively supports multiple character text qualifiers. This will allow me to pick something ridiculous that is incredibly unlikely to appear in the data naturally. Changing the text qualifier also means we won't have to add any extra transformations as we would if we tokenized the embeded quotation marks (to turn them back in to quotation marks).

Let's start out with the file we're importing so we have some text to test against (C:\SSISTextQualifier\Input.csv)

"one","1","two","2","th""r""ee","3","asdf"
"""left","right""","mid""dle","w"",tf","w,""tf",",""omg","omg"","

Now let's iron out the regular expression we're going to need.

We'll start out with grabbing a bunch of stuff between two double-quotes




Ok, we don't just want a bunch of stuff - we want the stuff between each set of quotation marks - we make the * quantifier lazy with a question mark




Better... but that th""r""ee is clearly wrong... we have to stop the final quotation mark from matching properly escaped quotation marks. We can do that with a negative lookahead:




That's different - notice that now properly escaped quotation marks aren't abnormally terminating the column match anymore, now they're just acting like normal characters. We have to make a special exception for escaped quotation marks. Those have to essentially count as a single character:




Perfect - Now we can replace the text qualifiers with something awesome like |~| since we have matched the column content safely

Now that we have our regex, we just have to decide how to execute.

Plan A: SSIS Script Task

Create a new package with a single script task in it

Now edit the script (C#) and add the following up by the existing using statements:

using System.IO;
using System.Text.RegularExpressions;

Modify the Main method to look something like this (depending on your paths):

        public void Main()
        {
            StreamReader sr = new System.IO.StreamReader(@"C:\SSISTextQualifier\Input.csv");
            string allText = sr.ReadToEnd();
            sr.Close();
            StreamWriter sw = new StreamWriter(@"C:\SSISTextQualifier\InputCleaned.csv");
            sw.Write(
                    Regex.Replace(
                        allText, 
                        @"""((?:""""|.)*?)""(?!"")", 
                        @"|~|$1|~|"
                    ).Replace("\"\"", "\"")
            );
            sw.Close();
            Dts.TaskResult = (int)ScriptResults.Success;
        }

Then run that task and you will have a new file called C:\SSISTextQualifier\InputCleaned.csv that's text qualified by |~| and the escaped double quotes have been replaced with a single double quote.

Plan B: Powershell

Since we often call SSIS packages from the SQL Server Agent, we can easily add a Powershell step right before the SSIS package to copy and cleanse our file. If that's the route you choose, your powershell step will look something like this:

gc C:\SSISTextQualifier\Input.csv | 
% {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} | 
sc C:\SSISTextQualifier\InputCleaned2.csv

Final Steps

However you got your data in to a new file (C:\SSISTextQualifier\InputCleaned.csv in my case) with a new text qualifier, you still have to then import the file. This is done using our old friend the Flat File Connection Manager with the source being the newly minted clean file and with our new text qualifier set as |~|

And that's pretty much it:

Conclusion

Personally, I think the Powershell plan is the cleanest and easiest to implement. The code isn't hidden away in a script task, and it looks cool. Doing it this way allows us to kind of abstract this "file fixing" from the SSIS data flows and let them do what they do best.

Comments (28) Trackbacks (0)
  1. Hi,

    I have a csv which has double quotes as text qualifiers and there is a value in it like ,”WESLEY “NIM” /EMILY”, Technically the output should be |~|wesley nim /emily|~|. When I used the above logic I ended up with |~|wesley |~|nim|~| /emily|~|. Any pointers on how the regular expression should be modified to handle this case?

    I am a newbie in regular expressions. Any help is greatly appreciated.

    Thanks!

    • Yeah, you need to fix your export side. ,”WESLEY “NIM” /EMILY” isn’t valid anywhere, those inner quotes have to be escaped in some way. It should probably be “WESLEY “”NIM”" /EMILY”

      If you can’t fix your export side, let me know and I can get you close w/ the bad export – the fundamental problem is you end up having to make assumptions and all the suddon, commas inside of text fields can be and issue.

      • I can’t fix the export side. There are 3 or 4 rows like this within each file. I was hoping we could tweak the expression to get the expected output.

        Thanks for the quick repsonse.

        • Try this: (?<=,|^)\s*”((?:”"|.)*?)”(?!”)\s*(?=,|$)

          make sure to replace the quotes with normal quotes, i think my theme turns them in to pretty quotes.

  2. Please tell me if I am even doing this right. I took your expression and put it in the below expression in the script task. I am obviously doing something wrong here and I am getting an error. Please guide me.

    sw.Write(
    Regex.Replace(
    allText,
    @”((?<=,|^)\s*"((?:""|.)*?)"(?!")\s*(?=,|$)",
    @"|~|$1|~|"
    ).Replace("\"\"", "\"")
    );

  3. Is there any one know how to handle unicode? for
    Regex.Replace(
    allText,
    @”"”((?:”"”"|.)*?)”"(?!”")”,
    @”|~|$1|~|”
    ).Replace(“\”\”", “\”")

  4. By changing below can handle unicode.

    StreamReader sr = new System.IO.StreamReader(@”C:\SSISTextQualifier\Input.csv”, false, System.Text.Encoding.Unicode);

    This article is very useful and helpful.

    Thanks….

  5. Appreciate the alternative solution, but the regex fails to account for quotations used as escape characters for literal commas. Can the regex be modified to also account for data such as “”,” … where the text is qualified by quotes, and the inner-quote is used as an escape character? The parsed data should just be a single comma.

  6. Hi Tim!

    I ran into the same problem and some googleing brought me here to your very nice article. Good work, especially the Regular Expression part is impressive, but I wonder if the problem could be solved a little bit easier. I just implemented the following solution: In the Flat-File Source I omitted the Text-Qualifier so that the texts include the qualifiers (e.g. “Hel”"lo!”). Next I used a derived column to cut of the first and the last Character (that is Text-Qualifiers) by using this expression: SUBSTRING ([Column1]),2,LEN([Column1])-2). Works like a charm. Nevertheless you have written a great article :-).

    Have a nice day

    Frank

    • So in that scenario, are you saying you created the file? And if you’re omitting text qualifiers – how do you handle commas in the middle of your free form text fields?

  7. I’m trying to clean a file using your RegEx but it doesn’t grab the embeded double quotes.

    I have a file that may have.

    “abc”, ” I work on all “Window platforms” since 2000, “def”

    I need to have it
    come out
    |~|abc|~|, |~|I work on all “Window platforms” since 2000|~|, |~|def|~|

    thanks

    • Looks like my text field has return (user hit the enter key on text box) and the RegEx example doesn’t handle that.

      • So first of all, i think you meant to have a double quote after 2000 in your first example – otherwise it makes zero sense.

        I thought about it a little bit – i guess my first reaction is that’s simply not a valid file, embedded text qualifiers must be *somehow* escaped. Now i also get that sometimes you didn’t make the file, someone else did. In that case, you could try escaping those embedded qualifiers. If that were the plan, you would want to escape any quotation marks that don’t have a comma to the left or right of them and another quote (sort of – there are cases where this would fail, but you can pry get close enough).

        Replacing (?<!(?:(?:"\s*,)|^)\s*)"(?!\s*(?:(?:,\s*")|$))
        With ""

        Before you replace the text qualifiers will fix the particular example you have there – but it will fail in scenarios where something is embedded that fits the pattern like:

        "asdf", "this "," is confusing", "def"

        What on earth could you do with that? Basically nothing, that’s exactly why the example in general is invalid – because impossible scenarios like this could, technically, occur. Anyhoo, hope you can get away with the example there!

  8. I have comma bettwen quotes and that separate in another field, Here is contents of my test file

    field1,field2,field3,field4,field5,field6
    n1,n2,n3,stringWithout Quotes,”String in Quotes with “”words”" and coma , in the string”,some more
    n1,n2,n3,”stringWith Quotes”,”String in Quotes with coma , in thestring”,some more

    I tried your regex but its splitting “String in Quotes with “”words”" and coma , in the string” this value in two different column String in Quotes with “words” and coma AND in the string. Can you please help?

  9. Thank you very much for the post. It is really helpful and working on my stuff very well. But I have one question is some of our data have Spanish letters, let’s say the original one is “Batistão”, after converting text qualifier to |~|, the data turns to “Batist�o”. Anyone has any idea what I should put into the script?

    Thank you!

    • That means you’re converting it to VARCHAR/ASCII somewhere along the way. Make sure that you always treat that text as Unicode/NVARCHAR.

      • Hello Tim,

        Thank you so much for your quick response.

        I modified the script a little bit to solve the Spanish letters issue.

        StreamReader sr = new System.IO.StreamReader(@FilePath,System.Text.Encoding.GetEncoding(1252));

        StreamWriter sw = new StreamWriter(@CleanedFilePath, false, System.Text.Encoding.GetEncoding(1252));

        Anyway, thank you for your help!

        Jing

  10. Hi Jing,

    I am getting below error while running the script with more than 100 KB file.
    Error: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. —> System.OutOfMemoryException: Exception of type ‘System.OutOfMemoryException’ was thrown.
    at System.String.GetStringForStringBuilder(String value, Int32 startIndex, Int32 length, Int32 capacity)
    at System.Text.StringBuilder.GetNewString(String currentString, Int32 requiredLength)
    at System.Text.StringBuilder.Append(String value)
    at System.Text.RegularExpressions.RegexReplacement.ReplacementImpl(StringBuilder sb, Match match)
    at System.Text.RegularExpressions.RegexReplacement.Replace(Regex regex, String input, Int32 count, Int32 startat)
    at System.Text.RegularExpressions.Regex.Replace(String input, String replacement, Int32 count, Int32 startat)
    at System.Text.RegularExpressions.Regex.Replace(String input, String replacement)
    at System.Text.RegularExpressions.Regex.Replace(String input, String pattern, String replacement)
    at ST_2dca26d3ee97406cbefd582988607412.csproj.ScriptMain.Main()
    — End of inner exception stack trace —
    at System.RuntimeMethodHandle._InvokeMethodFast(Object target, Object[] arguments, SignatureStruct& sig, MethodAttributes methodAttributes, RuntimeTypeHandle typeOwner)
    at System.RuntimeMethodHandle.InvokeMethodFast(Object target, Object[] arguments, Signature sig, MethodAttributes methodAttributes, RuntimeTypeHandle typeOwner)
    at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture, Boolean skipVisibilityChecks)
    at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
    at System.RuntimeType.InvokeMember(String name, BindingFlags bindingFlags, Binder binder, Object target, Object[] providedArgs, ParameterModifier[] modifiers, CultureInfo culture, String[] namedParams)
    at System.Type.InvokeMember(String name, BindingFlags invokeAttr, Binder binder, Object target, Object[] args, CultureInfo culture)
    at Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTATaskScriptingEngine.ExecuteScript()

    Is there any way to handle this? Please help me

  11. Thank you! This solution worked very well for me. Actually my data set was even more complicated and required handling of SHIFT-JIS encoding on top of this. Furthermore, the data also had newlines in the text data.

    This regex did the trick for multi-line data

    “((?:”"|\n|.)*?)”(?!”)

    e.g.,

    “one”,”1″,”two”,”2″,”th”"r”"e
    e”,”3″,”asdf”
    “”"left”,”right”"”,”mid”"dle”,”w”",tf”,”w,”"tf”,”,”"omg”,”omg”",”

  12. Could not get the Regex to work for the following:

    “002568″,”BUGS BUNNY”,”2552 “D” STREET”,”",”ATLANTA”,”GA”

  13. THIS ROCKS! Thankyou Thankyou Thankyou.

    My only edit, since we all say thank you and then critique the effort, is the using statement for RegEx. You used:
    using System.RegularExpressions;

    I had to change to
    using System.Text.RegularExpressions;

    Difference in .net versions? I’m using BIDS 2008R2 SP1

  14. This was VERY helpful. The regex script was just the ticket!


Leave a comment

No trackbacks yet.