RegexClean Transformation

by Darren Green 6 Sep 2014 16:03

Use the power of regular expressions to cleanse your data right there inside the Data Flow. This transformation includes a full user interface for simple configuration, as well as advanced features such as error output configuration.

Two regular expressions are used, a match expression and a replace expression. The transformation is designed around the named capture groups or match groups, and even supports multiple expressions. This allows for rich and complex expressions to be built, all through an easy to reuse transformation where a bespoke Script Component was previously the only alternative.

Some simple properties are available for each column selected –

Behaviour
The two behaviour modes offer similar functionality but with a difference. Replace, replaces tokens with the input, and Emit overwrites the whole string.

Cascade
Cascade allows you to define multiple expressions, each on a new line. The match expression will be processed into one operation per line, which are then processed in order at run-time. Multiple replace expressions can also be specified, again each on a new line. If there is no corresponding replace expression for a match expression line, then the last replace expression will be used instead. It is common to have multiple match expressions, but only a single replace expression.

Match Expression
The expression used to define the named capture groups. This is where you can analyse the data, and tag or name elements within it as found by the match expression.

Replace Expression
The replace determines the final output. It will reference the named groups from the match expression and assembles them into the final output.

If you want to use regular expressions to validate data then try the Regular Expression Transformation.

Quick Start Guide

  • Select a column. A new output column is created for each selected column; there is no option for in-place replacement of column values. One input column can be used to populate multiple output columns, just select the column again in the lower grid, using the Input Columns drop-down selector.
  • Amend the output column name and size as required. They default to the same as the input column selected.
  • Amend the behaviour as required, the default is Replace.
  • Amend the cascade option as required, the default is true.
  • Finally enter your match and replace regular expressions
  • Quick Sample #1

    Parse an email address and extract the user and domain portions. Format as a web address passing the user portion as a URL parameter. This uses two match groups, user and host, which correspond to the text before the @ and after it respectively.

    Behaviour is Emit, and cascade of false, we only have a single match expression.

    Match Expression ^(?<user>[^@]+)@(?<host>.+)$

    Replace Expression - http://www.${host}?user=${user}

    Results

    Sample Input Sample Output
    zheng0@adventure-works.com http://www.adventure-works.com?user=zheng0

    The component is provided as an MSI file, however for 2005/2008 you will have to add the transformation to the Visual Studio toolbox manually. Right-click the toolbox, and select Choose Items.... Select the SSIS Data Flow Items tab, and then check the RegexClean Transformation from the list.

    Downloads

    The RegexClean Transformation is available for SQL Server 2005, SQL Server 2008 (includes R2) and SQL Server 2012. Please choose the version to match your SQL Server version, or you can install multiple versions and use them side by side if you have more than one version of SQL Server installed.

    RegexClean Transformation for SQL Server 2005

    RegexClean Transformation for SQL Server 2008

    RegexClean Transformation for SQL Server 2012

    RegexClean Transformation for SQL Server 2014

    Version History

    SQL Server 2014

    Version 4.0.0.8 - SQL Server 2014 release. Includes upgrade support for both 2005, 2008 and 2012 packages to 2014.
    (6 Sep 2014)

    SQL Server 2012

    Version 3.0.0.7 - SQL Server 2012 release. Includes upgrade support for both 2005 and 2008 packages to 2012.
    (6 Jun 2012)

    SQL Server 2008

    Version 2.0.0.7 - SQL Server 2008 Release
    (20 Oct 2008)

    SQL Server 2005

    Version 1.0.0.105 - Public Release
    (28 Jan 2008)

    Screenshot

    RegexClean Transformation Editor dialog

    Comments (26) -

    1/22/2009 3:12:57 PM #

    Tommy Petersson

    A nice component, unfortunately it doesn't seem to be able to do what I want at the moment...

    I would like it to check a column from a Flat File Source to see if it contains a date. Since there's no ISDATE in SSIS I wanted to see if this was an OK workaround. I'm OK with regexp checking for yyyy/mm/dd, even if it's 1981/44/55 - if the system creating the flat file has a date to export it's correct. Otherwise it's either an empty string or some blanks. If there is no date in the column, It's possible to derive that info from another column - but RegexClean don't let you get data from another column...

    Tommy Petersson Sweden

    1/23/2009 11:27:29 AM #

    Allan Mitchell

    Tommy

    It sounds like you may just need to validate the column value in which case www.sqlis.com/.../...xpression-Transformation.aspx may be more appropriate.  if your validation fails then on the alternate output you can hand a Derived Column transform which will be used to derive the dae from the other columns you mention.

    Allan Mitchell Germany

    2/20/2009 2:32:29 AM #

    Tim

    Anyone tried doing a replace with an empty expression?

    I want to strip whitespace out of the input, and don't see any mechanism to specify empty place in RegEx, and the component won't accept an empty expression for a Replace.

    Tim

    2/20/2009 7:56:32 AM #

    Allan Mitchell

    I would probably use a Derived Column transform to replace my existing column.  There is a string REPLACE() function in there.

    Allan Mitchell United Kingdom

    2/24/2009 1:43:12 AM #

    Tim

    Cheers Allan, I had decided Replace was the way to go anyway.
    Another issue: Doing a Replace to replace following characters with a space that I then do the Dervied Replace on:
    [\+\-\(\)]

    The output column value is null though if none of those characters are in input - this doesn't seem right to me.

    Tim

    2/24/2009 1:44:34 AM #

    Tim

    My temp solution is to append a ' ' at end of column in the query..

    Tim

    3/26/2009 6:26:39 PM #

    Jason

    Hi, I am trying to remove all alphanumeric characters from the string using [^0-9] in Match section, but couldn't find the correct expression for the Replace section. Thanks,

    Jason United Kingdom

    5/1/2009 3:14:11 PM #

    Steven Molen

    Any way to access members within the group itself? i.e. if i have a string of xyz.abc.bab.wqrx and i have a match expression of ^(?<stringsonly>[^.]*), I want to access the different members separately i the replace syntax.

    Steven Molen United States

    6/25/2009 7:18:05 PM #

    Devo

    @Jason

    I did something similar and posted it out on MSDN. (social.msdn.microsoft.com/.../adcdc0f5-9191-4937-97a2-5a63c34fc685)  I didn't find an eloquent way to do it.

    Devo United States

    8/3/2009 9:34:47 PM #

    Derrick

    I have unstructured text in a nvarchar(max) column that I am trying to extract into sepearate fields. I need to search for specific strings e.g. "THC METABOLITE CONFIRMED BY GC/MS 21 NG/ML" and extract the numbers after the GC/MS. I've written a regex \bTHC\sMETABOLITE\sCONFIRMED\sBY\sGC/MS\b.(?<amt>\d{4})to search for my string. How do I get the RegEx Clean to replace the string with the 21. The RegEx transformation correctly identifies every occurence, however I can not get the RegEx Clean to work. Any suggestions would be appreciated

    Derrick

    8/6/2009 10:22:21 PM #

    Derrick

    Built a Script Destination Component to put text into the data stream and used the RegEx clean in emit mode with the cascade option to search 34 specific patterns and replaced them all. Strings are similar to the ones in my previous post (\bTHC\sMETABOLITE\sCONFIRMED\sBY\sGC/MS\b.(?<amt>\d{4}). The transform works well.

    Good job!

    Derrick

    9/25/2009 5:44:28 PM #

    Eric

    Tool has been very useful- only my [mis]understanding of regexp getting in my way.
    Although, I have yet to figure out how to remove a row once it's been created.
    I am reading an http log file and culling out several data elements from the string. One of them is causing an error (I guess my regexp is incorrect) but I can't figure out how to remove that field from the transform in order to move forward.

    Any help is appreciated.

    Eric United States

    1/27/2010 7:05:50 PM #

    Kyle

    This works great for what I need to do involving dates.  However, I recieve an error when the column is null and the next output is to a derived column (I'm combining the date and time to a datetime after formatting the date and time in regex). "[RegexClean] Error: ocessing with an exception. Value cannot be null. Parameter name: input"


    Any words?

    Kyle United States

    1/28/2010 7:55:14 PM #

    Eric

    Devo,

    I have written a SSIS component called RegExtractor that does what you would like to do.  You can specify a regular expression to operate on any text column and the captured matches of that regex get output to new columns.  The SSIS component is free and open source.  You can get it at:

    http://regextractor.codeplex.com

    Hope this helps (I realize I'm late in my response)

    Eric

    Eric United States

    3/20/2010 2:26:23 AM #

    Spencer

    Can someone expand upon what the Cascade option does? I've tried playing with it, but it doesn't produce any different results whether the box is checked or unchecked. I'm hoping someone can provide an example as to when you would or would not use it.
    Thanks.

    Spencer United States

    8/10/2010 2:19:02 AM #

    andrew

    Have you adapted this for VS 2010 yet?

    andrew Canada

    8/10/2010 1:09:58 PM #

    andrew

    Sorry, I got it installed in the right place.  Really nice utility, thanks.

    andrew Canada

    9/17/2010 10:19:37 PM #

    Dave Bunch

    I want to extract characters [a-zA-Z0-9] from email addresses using the RegexClean component. When I first opened the RegexClean editor, for the input column 'email' I named the output column 'email' and for a 'Match Expression' I used:

    (?<Word>[a-zA-Z0-9]+)

    I did not enter anything into or touch the 'Replace Expression' column. When I ran the package, RegexClean did exactly what I wanted it to:

    Test email address before: sam_smith@gmail.com   after: samsmithgmailcom

    When I reopened the RegexClean editor, I accidentally clicked on the 'Match Expression' and from then on, received the error message: 'Replace Expression property value cannot be null.'  I can't seem to enter any replace expression that extracts the same output as before when I (incorrectly) used the component with an empty 'Replace Expression'. Any suggestions would be appreciated.

    Dave Bunch United States

    10/31/2010 9:28:00 PM #

    Doug

    This tool is pure dynamite!  I am having a ball with it.

    Doug United States

    11/1/2010 10:23:27 PM #

    Doug

    RegexClean does not appear to work with SSIS on SQL Server 2008 R2.  Is this product being maintained?

    Doug United States

    11/9/2010 12:39:14 PM #

    Darren Green

    Doug, Can you explain what problems you are having on R2? All our 2008 components are fully compatible with R2. If you have moved to a new R2 server, make sure you have installed the component on the new machine.

    Darren Green United Kingdom

    11/10/2010 10:11:13 PM #

    Dave Bunch

    Figured out how to remove non-alphanumeric characters from a string

    This was such a pain in the arse to figure out that I wanted to post my solution to hopefully help others. I wanted to use RegexClean to strip non-alphanumeric characters from a string - in this case an email address -- and return only the alphanumeric characters that remained. Here are the RegexClean expressions I used to achieve this:

    Match Expression: (?<stringIN>[^a-zA-Z0-9]+)|(?<stringOUT>[a-zA-Z0-9]+)

    Replace Expression: ${stringOUT}

    I really dig this component now. Thanks for sharing it!

    Dave Bunch United States

    1/25/2011 6:19:48 PM #

    Kenneth Franklin

    I am trying to use this component to clean flat file data. It works perfectly, unless there is no match on the row, when I just get NULL results. Using RegexClean with:
    Match exp: \s{2,100}[|]
    Replace exp: |
    On these two rows (one column of data in the data flow):
    00000|0000000|B0|0000000000000|P0001
    00000|0000000|  |0000000000000|00000
    Yields:
    NULL
    00000|0000000||0000000000000|00000
    Expected:
    00000|0000000|B0|0000000000000|P0001
    00000|0000000||0000000000000|00000

    Row 2 is modified properly, but row 1 is now NULL. This Match/Replace combination works in any regexp tester I have tried, but not with this component. Am I doing something wrong?

    Thanks!

    Kenneth Franklin United States

    3/10/2011 8:38:40 AM #

    Ty Parten

    Thanks for the utilities! They help quite a lot. (Love the trash destination :)

    Is there a way to replace the Match and Replace expressions dynamically prior to entering RegExClean? I'm processing property-value pairs and the configuration of the transform is dynamic from pair to pair. Of course I'm processing the transforms in a loop so that I only have to replace the expressions once per batch. (i.e. not replacing row-by-row, lol)

    I'm familiar with how to set properties ahead of time (empty for loop) but I can't seem to locate the correct ones.

    Thanks!

    Ty Parten United States

    12/21/2011 6:12:43 PM #

    Craig Bobchin

    This looks pretty cool. I have a question about a particular file I'm trying to clean however, I have a | delimited text file, One of the fields (in the middle of the record no less) can contain 1 or more embedded returns before you get to the next field.

    I've got a RegEx macro that I use in Ultraedit to clean the file prior to importing, but I'd like to be able to do it in the SSIS package. This macro removes the embedded returns and puts the record back to one line/record.

    Can this transform handle embedded returns in a field? If so how would I go about setting it up?

    thanks

    Craig Bobchin United States

    12/21/2011 6:44:02 PM #

    Craig Bobchin

    I have one more question. It appears this does not work for text stream (memo) type fields. Is there any hope for adding support for them?

    thanks

    Craig

    Craig Bobchin United States

    Pingbacks and trackbacks (2)+

    Add comment

      Country flag

    biuquote
    • Comment
    • Preview
    Loading

    RecentComments

    Comment RSS