Get all from Table A that isn't in Table B

by Guest 1 Sep 2004 14:00

(By Jamie Thomson)

A common requirement when building a data warehouse is to be able to get all rows from a staging table where the business key is not in the dimension table. For example, I may want to get all rows from my STG_DATE table where the DateID is not in DIM_DATE.DateID.

There are 2 ways to do this in conventional SQL.

    -- Method #1
INSERT    DIM_DATE
SELECT    s.*
FROM      STG_DATE s
LEFT OUTER JOIN DIM_DATE d
ON        s.DateID = d.DateID
WHERE     d.DateID IS NULL
-- Method #2
INSERT    DIM_DATE
SELECT    s.*
FROM      STG_DATE s
WHERE     DateID NOT IN (
SELECT    DISTINCT DateID
FROM      DIM_DATE d
)

There are many cases where using conventional T-SQL may not be an option in achieving this. Perhaps the data is on different servers. Or perhaps STG_DATE isn't even a table; it may be a text file. In these cases you may have to use DTS to achieve your required results.

There are 2 methods of doing this in DTS 2005. The first method is analogous to the first SQL statement above whereas the second method builds on some of the new functionality in DTS 2005.

We're going to need a source and destination table for demonstration purposes. Let's create them now. For simplicity we are going to place the tables into the same database.

IF EXISTS (SELECT * FROM sys.tables WHERE NAME = 'STG_DATE') DROP TABLE STG_DATE
CREATE TABLE STG_DATE
(
DateID       INT        NOT NULL    PRIMARY KEY
,   TheDate      DATETIME   NOT NULL
,   DayOfMonth   INT        NOT NULL
,   DayOfYear    INT        NOT NULL
,   DayOfWeek    INT        NOT NULL
,   DayName      VARCHAR(9) NOT NULL
,   WeekOfYear   INT        NOT NULL
,   WeekName     VARCHAR(7) NOT NULL
,   MonthOfYear  INT        NOT NULL
,   MonthName    VARCHAR(9) NOT NULL
,   Quarter      INT        NOT NULL
,   QuarterName  CHAR(7)    NOT NULL
,   Year         INT        NOT NULL
,   Hour         INT        NOT NULL
,   Minute       INT        NOT NULL
,   Second       INT        NOT NULL
,   IsWeekday    BIT        NOT NULL
,   DayOfQuarter INT        NOT NULL
)
IF EXISTS (SELECT * FROM sys.tables WHERE NAME = 'DIM_DATE') DROP TABLE DIM_DATE
CREATE TABLE DIM_DATE
(
DateID       INT        NOT NULL    PRIMARY KEY
,   TheDate      DATETIME   NOT NULL
,   DayOfMonth   INT        NOT NULL
,   DayOfYear    INT        NOT NULL
,   DayOfWeek    INT        NOT NULL
,   DayName      VARCHAR(9) NOT NULL
,   WeekOfYear   INT        NOT NULL
,   WeekName     VARCHAR(7) NOT NULL
,   MonthOfYear  INT        NOT NULL
,   MonthName    VARCHAR(9) NOT NULL
,   Quarter      INT        NOT NULL
,   QuarterName  CHAR(7)    NOT NULL
,   Year         INT        NOT NULL
,   Hour         INT        NOT NULL
,   Minute       INT        NOT NULL
,   Second       INT        NOT NULL
,   IsWeekday    BIT        NOT NULL
,   DayOfQuarter INT        NOT NULL
)

And we're going to need some data in STG_DATE. If you don't edit it the following script will create a data-set from 1st Jan 1900 to 31st Dec 2050 which results in 55152 rows.

SET DATEFIRST 1
DECLARE    @startdate  DATETIME
DECLARE    @enddate    DATETIME
DECLARE    @date       DATETIME
DECLARE    @id         INT
SET        @startdate  =    '1900-01-01'  --Change these to 
SET        @enddate    =    '2050-12-31'  --whatever you want
SET        @id         =    0
SET        @date       =    DATEADD(dd, @id, @startdate)
WHILE    @date <= @enddate
BEGIN
INSERT INTO    STG_DATE
VALUES (@id                                                            --DateID
,    @date                                                             --TheDate
,    DATEPART(dd, @date)                                               --DayOfMonth
,    DATEPART(dy, @date)                                               --DayOfYear
,    DATEPART(dw, @date)                                               --DayOfWeek
,    DATENAME(dw, @date)                                               --DayName
,    DATEPART(ww, @date)                                               --WeekOfYear
,    'Week ' + RIGHT('0' + DATENAME(ww, @date), 2)                     --WeekName
,    DATEPART(mm, @date)                                               --MonthOfYear
,    DATENAME(mm, @date)                                               --MonthName
,    DATEPART(qq, @date)                                               --Quarter
,    'Q' + DATENAME(qq, @date) + ' ' + DATENAME(yy, @date)             --QuarterName
,    DATEPART(yy, @date)                                               --Year
,    DATEPART(hh, @date)                                               --Hour
,    DATEPART(mi, @date)                                               --Minute
,    DATEPART(ss, @date)                                               --Second
,    CASE 
WHEN DATEPART(dw, @date) IN (6,7) THEN 0
ELSE 1
END                                                               --IsWeekday
,    (DATEDIFF(DAY, DATEADD(qq, DATEDIFF(qq,0,@date) ,0) ,@ate) + 1)  --DayOfQuarter
)
SET  @id    =    @id + 1
SET  @date  =    DATEADD(dd, @id, @startdate)
END

Our aim is quite simply to get all the data from STG_DATE into DIM_DATE using DTS 2005.

Method 1

This method is an implementation of the LEFT OUTER JOIN with an IS NULL clause SQL statement that you see at the top of this article.

Let's have a look at the data flow.

Data Flow

The STG_DATE source and DIM_DATE source OLE DB Source adapters point to the 2 tables we have just created. I am assuming that you are familiar with OLE DB Source adapters and know how to set up connections to the two tables. The Order by DateID* Sort transformations do exactly what they say on the tin.

Now we are going to join the data sourced from the STG_DATE and DIM_DATE tables.

  • Drag a Merge Join transformation onto the designer and rename it Left Outer Join on DateID
  • Drag the green connector from Order by DateID STG_DATE to Left Outer Join on DateID
  • The Input Output Selection dialog will appear. From the "Input:" combo box select "Merge Join Left Input". We are modeling a SQL Left Outer Join hence it is important that the correct data flow is applied to the correct Left Outer Join on DateID input
  • Double click Left Outer Join on DateID to enter the Merge Join Editor
  • Change the Join Type to "Left Outer Join"
  • Select all the check boxes next to the columns in the left input. This will add them to the output data flow from Left Outer Join on DateID
  • Select the DateID column from the right input and give it an alias of DIM_DATE_DateID. This will add the DateID column to the output data flow from Left Outer Join on DateID
Your Merge Join Editor should now look something like the following.

Merge Join

The Conditional Split transformation is used to split a data flow into multiple data flows based on the state of the data. In this case we are going to use it to identify all the data that has a NULL value in the DIM_DATE_DateID column.

  • Drag a Conditional Split transformation onto the data flow designer
  • Rename the Conditional Split transformation as Get New Dates
  • Drag the output connector from Left Outer Join on DateID to Get New Dates
  • Double click Get New Dates to view the Conditional Split Editor
  • Expand Columns in the top left treeview and drag the DIM_DATE_DateID column into first row in the bottom half of the editor. This will enable us to build a condition on this column that allows us to define what data goes into this output
  • Change the Condition to "ISNULL([DIM_DATE_DateID])". This will ensure that only rows that have the value NULL in the DIM_DATE_DateID column will be included in this output
  • Change the Output Name to "New Rows"
Your Conditional Split Editor should now look like this.

Merge Join

Finally, we need to put the data somewhere at the end of the data flow.

  • Drag an OLE DB Destination adapter onto the designer
  • Rename it DIM_DATE destination
  • Drag the output connector from Get New Dates Conditional Split to DIM_DATE destination
  • The Input Output Selection dialog will appear. From the "Output:" combo box select "New Rows"
  • Double click DIM_DATE destination
  • On the Connection tab select the DIM_DATE table
  • Click on the mappings tab. The correct mappings will be created automatically by joining fields with identical names

And that's it! Executing this data flow will result in all 55152 rows being copied from STG_DATE to DIM_DATE.

Method 2

This method uses the new DTS Lookup transformation. Before running this you will have to empty the destination table that you have just populated with method 1.

  • Execute TRUNCATE TABLE DIM_DATE in SQL Server management Studio

Let's have a look at the data flow.

Data Flow

The STG_DATE OLE DB Source adapter points to the STG_DATE table that we created earlier. Again I will assume you can configure this yourself.

Now drag a Lookup transformation onto your data flow designer.

  • Rename the Lookup as Lookup DIM_DATE
  • Drag the green connector from STG_DATE to Lookup DIM_DATE
  • Double click Lookup DIM_DATE to open the Lookup Editor
  • In the "Connection" combo box select the connection that points to the DIM_DATE table
  • From the "Use a table or a view" combo box select the DIM_DATE table. This is specifying that we are going to use the DIM_DATE table as a lookup
  • Click the Columns tab
  • Mappings between the input columns and the lookup columns will have been created automatically. Delete all mappings except the one between the DateID columns. This ensures that we will use values from the lookup table where the values in the DateID columns are equal

The Lookup Editor Columns tab should look like the following.

Lookup

The error output from Lookup DIM_DATE will contain all the rows for which there was no matching lookup record in the lookup table. In other words, it will contain all the rows that are not already in the lookup table. This is not an error per se but it achieves our aim.

Drag a new OLE DB Destination adapter onto the data flow designer

  • Rename the adapter DIM_DATE
  • Drag the red Error output from Lookup DIM_DATE to DIM_DATE
  • The "Configure Error Output" dialog will be displayed. In the Error combo box change the selection to "Redirect Row" and click OK. This will ensure that no error is raised when rows get added to the Error output
  • Double click DIM_DATE to display the OLE DB Destination Editor
  • Configure DIM_DATE to point to the appropriate table and click on the mappings tab. The mappings will be configured automatically.

And that is it! Again, executing this data flow will result in all 55152 rows being copied from STG_DATE to DIM_DATE.

Conclusion

There are 2 methods that you can use to accomplish this requirement. Method 2 is quicker to build although some people may still choose to use the first method as they may be less happy in using the error output for something for which it was not designed. Experiment to see which method works best for your data set.

Comments (18) -

2/20/2009 8:44:29 AM #

RK

This helpful alot to me..

RK

2/24/2009 8:45:47 AM #

marvin

Yeah, these 2 methods are only used for inserting new records into designate table.  What if we want to update existing record in the designate table when there is a match?  how would you handle it?

marvin

2/24/2009 2:01:53 PM #

Allan Mitchell

Consume the output that has the matches so in the first example it is the output "Conditional Split Default Output".  Then I would test those rows further using a derived column transform or similar.  Just because you have seen the row before does not mean you have to update the existing version.  I might do a test similar to

InValueColumn != ExistValueColumn

This will tell me if a column value has changed between this run and the existing entry.  I personally would then stage the rows that require UPDATEing and UPDATE in an ExecSQL statement in the CTL Flow.  You could use a OLE DB Transform to do the same thing but perf will suck as you increase the amount of rows that require updating.

Allan Mitchell United Kingdom

3/12/2009 10:45:19 AM #

Peter

May I suggest using the TableDifference component from sqlbi.eu for the scenario Martin asked?
This will make it very easy to do all comparisons from one component.
It is my preferred component for these scenario's, as it is a lot faster than the Slowly Changing Dimension when you are working with a serious number of records, and you don't have to add and configure all sorts of expressions & conditional filters to find out what was changed and what is not.

Peter

3/16/2009 8:30:32 PM #

Syed

Checksum (CRC32) works fine in dev and works as expected in prod most of the time. But at times, Prod job fails with this message:

Executed as user: OPENTEXT\wlsqld01usr. Microsoft (R) SQL Server Execute Package Utility  Version 10.0.1787.0 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  12:30:55 PM  Error: 2009-03-13 12:31:13.66     Code: 0xC0047062     Source: X_OT_SOP_RAM Checksum [1037]     Description: System.ArgumentException: Item has already been added. Key in dictionary: '79764919'  Key being added: '79764919'     at System.Collections.Hashtable.Insert(Object key, Object nvalue, Boolean add)     at System.Collections.Hashtable.SyncHashtable.Add(Object key, Object value)     at Konesans.Dts.Component.Helpers.CRC32..ctor(UInt32 aPolynomial, Boolean cacheTable)     at Konesans.Dts.Pipeline.ChecksumTransform.ChecksumTransform.PreExecute()     at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPreExecute(IDTSManagedComponentWrapper100 wrapper)  End Error  Error: 2009-03-13 12:31:13.67     Code: 0xC0047062     Source: X_OT_SOP_LINE_WORK ChecksumDST [307]     Description: System.ArgumentException: Item has already been added. Key in dictionary: '79764919'  Key being added: '79764919'     at System.Collections.Hashtable.Insert(Object key, Object nvalue, Boolean add)     at System.Collections.Hashtable.SyncHashtable.Add(Object key, Object value)     at Konesans.Dts.Component.Helpers.CRC32..ctor(UInt32 aPolynomial, Boolean cacheTable)     at Konesans.Dts.Pipeline.ChecksumTransform.ChecksumTransform.PreExecute()     at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPreExecute(IDTSManagedComponentWrapper100 wrapper)  End Error  Error: 2009-03-13 12:31:13.67     Code: 0xC004701A     Source: X_OT_SOP_RAM SSIS.Pipeline     Description: component "Checksum" (1037) failed the pre-execute phase and returned error code 0x80070057.  End Error  Error: 2009-03-13 12:31:13.67     Code: 0xC004701A     Source: X_OT_SOP_LINE_WORK SSIS.Pipeline     Description: component "ChecksumDST" (307) failed the pre-execute phase and returned error code 0x80070057.  End Error  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  12:30:55 PM  Finished: 1:02:19 PM  Elapsed:  1884.63 seconds.  The package execution failed.  The step failed.


Does anybody has any idea how to fix it?

Syed Canada

3/19/2009 3:50:56 AM #

SDangol

This is very helpful, Method 2 is lot easier.

SDangol United States

4/16/2009 4:20:39 PM #

Chengshu

Hi Syed:
We have run into the same issue as you described here. We run the checksum package on a 64-bit production machine and we have seen this error from time to time.  We are looking for a solution on this as well..   Any idea   ?

OnError,DBSW0212,DMZMGMT\sgalvining,Validate Transform Data,{DD72865F-C8B1-4DF4-BF33-3724865CE957},{4E0CB18F-0373-414C-9852-4F29B28BD627},4/13/2009 5:03:49 AM,4/13/2009 5:03:49 AM,-1073450910,0x,System.ArgumentException: Item has already been added. Key in dictionary: '79764919'  Key being added: '79764919'
   at System.Collections.Hashtable.Insert(Object key, Object nvalue, Boolean add)
   at System.Collections.Hashtable.SyncHashtable.Add(Object key, Object value)
   at Konesans.Dts.Component.Helpers.CRC32..ctor(UInt32 aPolynomial, Boolean cacheTable)
   at Konesans.Dts.Pipeline.ChecksumTransform.ChecksumTransform.PreExecute()
   at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPreExecute(IDTSManagedComponentWrapper90 wrapper)

OnError,DBSW0212,DMZMGMT\sgalvining,Provider,{6F1B4233-8A08-4688-827F-FB98DC7F975F},{4E0CB18F-0373-414C-9852-4F29B28BD627},4/13/2009 5:03:49 AM,4/13/2009 5:03:49 AM,-1073450910,0x,System.ArgumentException: Item has already been added. Key in dictionary: '79764919'  Key being added: '79764919'
   at System.Collections.Hashtable.Insert(Object key, Object nvalue, Boolean add)
   at System.Collections.Hashtable.SyncHashtable.Add(Object key, Object value)
   at Konesans.Dts.Component.Helpers.CRC32..ctor(UInt32 aPolynomial, Boolean cacheTable)
   at Konesans.Dts.Pipeline.ChecksumTransform.ChecksumTransform.PreExecute()
   at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPreExecute(IDTSManagedComponentWrapper90 wrapper)

Chengshu United States

4/19/2009 3:55:34 PM #

Sarabjeet Singh

This is Perfect example for data Cleansing.Its help allot.Cheer Up!!!
Thank for posting this

Sarabjeet Singh United States

4/21/2009 7:27:29 PM #

chengshu wang

Hey, Sarabjeet

  I am not able to understand what your comments mean here.  We have the issue and are looking for any fix or solution.  Any one konws what the issue is and how to fix it.

chengshu wang United States

5/13/2009 8:41:28 AM #

Kevin Wils

Hi,

Excellent examples.

Example 1 gives me an error though on the final step 'Destination'.
[SQL Server Destination [286]] Error: Unable to prepare the SSIS bulk insert for data insertion.
[DTS.Pipeline] Error: component "SQL Server Destination" (286) failed the pre-execute phase and returned error code 0xC0202071.


I don't get this error when I go to the advanced settings of the destination and check out 'table lock'.
Is there an other way to avoid this error, any SQL server setting maybe? Or do I have to check out this setting on the most of the packages I am going to create in the future.

Thanks

Kevin Wils Belgium

6/9/2009 8:42:27 PM #

SKim

I have been trying to make Method 2 work but I keep getting this error:

[Insert Destination [14940]] Error: SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available.  Source: "Microsoft SQL Native Client"  Hresult: 0x80004005  Description: "Could not complete cursor operation because the table schema changed after the cursor was declared.".

SKim United States

6/11/2009 9:20:55 PM #

Michael Grey

Excelent man,

I was asked 1 hour ago to build a DTS in 2 hours.
Thanks to you I will get it done on time!

Thanks!!

Michael Grey Argentina

5/11/2010 4:23:38 PM #

Kye

I have a situation where I am updating a MS SQL (2005) Server DB nightly from a MS Access DB on a remote server. Data streams into the MS Access DB at the rate of 5 to 6 thousand records per day, from numerous sites. I have been using the second solution shown here until last week when we reached the point where the SQL server’s RAM memory was not able to hold the 20+ million records in the SQL DB to do the comparison. (The Access DB is archived periodically, so it is not nearly as large.) I tried the first solution shown above, but that, too, was not solving the problem. I came up with something that not only works, but also runs much faster. By the way, I also have the added complication that data are associated with two fields; a datetime and a site_ID field. Some sites (data streams) add records daily and some do not.

FIRST I made a VIEW in the SQL DB:
SELECT DISTINCT TOP (100) PERCENT MAX(Date_Time) AS maxtime, site_filename_ID
FROM         dbo.Valuestbl
GROUP BY site_filename_ID
ORDER BY site_filename_ID

This gives me the most recently downloaded time for each site.
  
The Merge Join is set up linking the sorted table from the MS Access DB to the sorted result from this view. Link is on site_ID (left outer join).
  
The Conditional Split takes the result of this join and outputs a "New Rows" output on the condition of "Date_Time (from MS Access) > maxtime (from SQL View).

And to cover the sites that have no new records (which result in a NULL value for the comparison) the Error Output in the Conditional Split is set to Ignore Failure Errors.

I have screenshots of all this if anyone is interested! Couldn't figure out how to get them to go into this comment.

Kye United States

8/5/2010 7:44:17 AM #

AmmarR

a very useful article

it worked with me and helped me alot

i spent alot of time to find a solution, and with this article i found it

AmmarR Bahrain

9/8/2010 3:46:14 PM #

Xavier Averbouch

great article!
thanks :-)

Xavier Averbouch France

12/24/2010 1:40:08 PM #

Dmitry

Works great! Thanks!
But in my case Method1 works only if condition in Conditional Split is !ISNULL() //NOT ISNULL

Dmitry Belarus

2/11/2011 11:43:40 PM #

Tom

Excellent article.  very useful and clear instructions.  Thanks

Tom United States

4/19/2011 10:44:04 PM #

Kiran

Excellent

Kiran United States

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

Widget Amazon Book not found.

Object reference not set to an instance of an object.X

RecentComments

Comment RSS