Revisiting Embedded Text Qualifiers and the SSIS Flat File Connection Manager

To quickly summarize the problem at hand, the Flat File Connection Manager doesn't support embedded Text Qualifiers. The reason this is a big issue is because the most commonly used text qualifier is a double quote (") and that's also a commonly used punctuation mark. That means if you have a CSV to import and lines that looks like this:
"one","1","two","2","th""r""ee","3","asdf"
"""left","right""","mid""dle","w"",tf","w,""tf",",""omg","omg"","
You would expect to import
| one | 1 | two | 2 | th"r"ee | 3 | asdf |
| "left | right" | mid"dle | w",tf | w,"tf | ,"omg | omg", |
That's not what you get, you get an error. I've seen a few different approaches to working around this known issue:
- Replacing embedded text qualifiers with some other token - " in this case
- Import each file as a single column and parse the row in a script task, handling error rows
What I'd like to explore in this post is using regular expressions to transform the entire file in to something that can be natively consumed by SSIS. There are two options for dealing with this:
- Tokenizing the quotation marks found within the cells
- Change the text qualifier
I chose to change the text qualifier because SSIS natively supports multiple character text qualifiers. This will allow me to pick something ridiculous that is incredibly unlikely to appear in the data naturally. Changing the text qualifier also means we won't have to add any extra transformations as we would if we tokenized the embeded quotation marks (to turn them back in to quotation marks).
Let's start out with the file we're importing so we have some text to test against (C:\SSISTextQualifier\Input.csv)
"one","1","two","2","th""r""ee","3","asdf" """left","right""","mid""dle","w"",tf","w,""tf",",""omg","omg"","
Now let's iron out the regular expression we're going to need.
We'll start out with grabbing a bunch of stuff between two double-quotes

Ok, we don't just want a bunch of stuff - we want the stuff between each set of quotation marks - we make the * quantifier lazy with a question mark

Better... but that th""r""ee is clearly wrong... we have to stop the final quotation mark from matching properly escaped quotation marks. We can do that with a negative lookahead:

That's different - notice that now properly escaped quotation marks aren't abnormally terminating the column match anymore, now they're just acting like normal characters. We have to make a special exception for escaped quotation marks. Those have to essentially count as a single character:

Perfect - Now we can replace the text qualifiers with something awesome like |~| since we have matched the column content safely

Now that we have our regex, we just have to decide how to execute.
Plan A: SSIS Script Task
Create a new package with a single script task in it

Now edit the script (C#) and add the following up by the existing using statements:
using System.IO; using System.RegularExpressions;
Modify the Main method to look something like this (depending on your paths):
public void Main() { StreamReader sr = new System.IO.StreamReader(@"C:\SSISTextQualifier\Input.csv"); string allText = sr.ReadToEnd(); sr.Close(); StreamWriter sw = new StreamWriter(@"C:\SSISTextQualifier\InputCleaned.csv"); sw.Write( Regex.Replace( allText, @"""((?:""""|.)*?)""(?!"")", @"|~|$1|~|" ).Replace("\"\"", "\"") ); sw.Close(); Dts.TaskResult = (int)ScriptResults.Success; }
Then run that task and you will have a new file called C:\SSISTextQualifier\InputCleaned.csv that's text qualified by |~| and the escaped double quotes have been replaced with a single double quote.
Plan B: Powershell
Since we often call SSIS packages from the SQL Server Agent, we can easily add a Powershell step right before the SSIS package to copy and cleanse our file. If that's the route you choose, your powershell step will look something like this:
gc C:\SSISTextQualifier\Input.csv | % {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} | sc C:\SSISTextQualifier\InputCleaned2.csv
Final Steps
However you got your data in to a new file (C:\SSISTextQualifier\InputCleaned.csv in my case) with a new text qualifier, you still have to then import the file. This is done using our old friend the Flat File Connection Manager with the source being the newly minted clean file and with our new text qualifier set as |~|
Conclusion
Personally, I think the Powershell plan is the cleanest and easiest to implement. The code isn't hidden away in a script task, and it looks cool. Doing it this way allows us to kind of abstract this "file fixing" from the SSIS data flows and let them do what they do best.






September 20th, 2012 - 13:54
Hi,
I have a csv which has double quotes as text qualifiers and there is a value in it like ,”WESLEY “NIM” /EMILY”, Technically the output should be |~|wesley nim /emily|~|. When I used the above logic I ended up with |~|wesley |~|nim|~| /emily|~|. Any pointers on how the regular expression should be modified to handle this case?
I am a newbie in regular expressions. Any help is greatly appreciated.
Thanks!
September 20th, 2012 - 14:37
Yeah, you need to fix your export side. ,”WESLEY “NIM” /EMILY” isn’t valid anywhere, those inner quotes have to be escaped in some way. It should probably be “WESLEY “”NIM”" /EMILY”
If you can’t fix your export side, let me know and I can get you close w/ the bad export – the fundamental problem is you end up having to make assumptions and all the suddon, commas inside of text fields can be and issue.
September 20th, 2012 - 14:42
I can’t fix the export side. There are 3 or 4 rows like this within each file. I was hoping we could tweak the expression to get the expected output.
Thanks for the quick repsonse.
September 20th, 2012 - 15:16
Try this: (?<=,|^)\s*”((?:”"|.)*?)”(?!”)\s*(?=,|$)
make sure to replace the quotes with normal quotes, i think my theme turns them in to pretty quotes.
September 22nd, 2012 - 11:33
Please tell me if I am even doing this right. I took your expression and put it in the below expression in the script task. I am obviously doing something wrong here and I am getting an error. Please guide me.
sw.Write(
Regex.Replace(
allText,
@”((?<=,|^)\s*"((?:""|.)*?)"(?!")\s*(?=,|$)",
@"|~|$1|~|"
).Replace("\"\"", "\"")
);
October 16th, 2012 - 05:14
What error are you getting?
October 16th, 2012 - 08:13
Appreciate your reply. But it’s been nearly a month ago and I don’t remember the exact error. But I got some red underlining for the Regex. I used the logic from this website and it worked for me.
http://www.ideaexcursion.com/2008/11/12/handling-embedded-text-qualifiers/
October 16th, 2012 - 09:54
glad you got it sorted out. “red underlines” over the regex means a you probably needed to add an @ symbol in front of the string to define it as a literal OR escape the backslashes.
October 16th, 2012 - 10:32
Thanks! Will try it with @ when I find some time.
November 14th, 2012 - 10:55
Is there any one know how to handle unicode? for
Regex.Replace(
allText,
@”"”((?:”"”"|.)*?)”"(?!”")”,
@”|~|$1|~|”
).Replace(“\”\”", “\”")
November 16th, 2012 - 18:12
By changing below can handle unicode.
StreamReader sr = new System.IO.StreamReader(@”C:\SSISTextQualifier\Input.csv”, false, System.Text.Encoding.Unicode);
This article is very useful and helpful.
Thanks….
November 20th, 2012 - 16:11
Appreciate the alternative solution, but the regex fails to account for quotations used as escape characters for literal commas. Can the regex be modified to also account for data such as “”,” … where the text is qualified by quotes, and the inner-quote is used as an escape character? The parsed data should just be a single comma.
November 20th, 2012 - 16:23
That doesn’t make any sense, you don’t escape commas enclosed in text qualifiers. That’s why you use text qualifiers in the first place.
December 29th, 2012 - 08:46
Hi Tim!
I ran into the same problem and some googleing brought me here to your very nice article. Good work, especially the Regular Expression part is impressive, but I wonder if the problem could be solved a little bit easier. I just implemented the following solution: In the Flat-File Source I omitted the Text-Qualifier so that the texts include the qualifiers (e.g. “Hel”"lo!”). Next I used a derived column to cut of the first and the last Character (that is Text-Qualifiers) by using this expression: SUBSTRING ([Column1]),2,LEN([Column1])-2). Works like a charm. Nevertheless you have written a great article
.
Have a nice day
Frank
January 26th, 2013 - 07:59
So in that scenario, are you saying you created the file? And if you’re omitting text qualifiers – how do you handle commas in the middle of your free form text fields?
January 23rd, 2013 - 14:50
I’m trying to clean a file using your RegEx but it doesn’t grab the embeded double quotes.
I have a file that may have.
“abc”, ” I work on all “Window platforms” since 2000, “def”
I need to have it
come out
|~|abc|~|, |~|I work on all “Window platforms” since 2000|~|, |~|def|~|
thanks
January 23rd, 2013 - 21:40
Looks like my text field has return (user hit the enter key on text box) and the RegEx example doesn’t handle that.
January 26th, 2013 - 07:43
So first of all, i think you meant to have a double quote after 2000 in your first example – otherwise it makes zero sense.
I thought about it a little bit – i guess my first reaction is that’s simply not a valid file, embedded text qualifiers must be *somehow* escaped. Now i also get that sometimes you didn’t make the file, someone else did. In that case, you could try escaping those embedded qualifiers. If that were the plan, you would want to escape any quotation marks that don’t have a comma to the left or right of them and another quote (sort of – there are cases where this would fail, but you can pry get close enough).
Before you replace the text qualifiers will fix the particular example you have there – but it will fail in scenarios where something is embedded that fits the pattern like:
What on earth could you do with that? Basically nothing, that’s exactly why the example in general is invalid – because impossible scenarios like this could, technically, occur. Anyhoo, hope you can get away with the example there!
March 31st, 2013 - 12:11
I have comma bettwen quotes and that separate in another field, Here is contents of my test file
field1,field2,field3,field4,field5,field6
n1,n2,n3,stringWithout Quotes,”String in Quotes with “”words”" and coma , in the string”,some more
n1,n2,n3,”stringWith Quotes”,”String in Quotes with coma , in thestring”,some more
I tried your regex but its splitting “String in Quotes with “”words”" and coma , in the string” this value in two different column String in Quotes with “words” and coma AND in the string. Can you please help?
April 30th, 2013 - 11:18
Thank you very much for the post. It is really helpful and working on my stuff very well. But I have one question is some of our data have Spanish letters, let’s say the original one is “Batistão”, after converting text qualifier to |~|, the data turns to “Batist�o”. Anyone has any idea what I should put into the script?
Thank you!
April 30th, 2013 - 11:20
That means you’re converting it to VARCHAR/ASCII somewhere along the way. Make sure that you always treat that text as Unicode/NVARCHAR.
May 2nd, 2013 - 09:26
Hello Tim,
Thank you so much for your quick response.
I modified the script a little bit to solve the Spanish letters issue.
StreamReader sr = new System.IO.StreamReader(@FilePath,System.Text.Encoding.GetEncoding(1252));
StreamWriter sw = new StreamWriter(@CleanedFilePath, false, System.Text.Encoding.GetEncoding(1252));
Anyway, thank you for your help!
Jing