We're a community of 1077K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,076,489 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

Self-join to remove duplicates

I have a bunch of records that I get from doing multiple joins.

# Name,   ID,     Date,      Gender,     Birthday #
John Doe  111   01/02/2003     M         01/01/2001
Jane Doe  222   03/04/2005     F         02/02/2002
Jane Doe  222   **03/05/2005**     F         02/02/2002
Jim Doe   333   07/08/2009     M         10/11/2012

What I am trying to get is the below. The records vary only in one column (in this case, it's the date column). All other columns are exactly same. 

Note that I only want one row for Jane Doe. This should be the row where date is newer (or more recent).
# Name,   ID,     Date,      Gender,     Birthday #
John Doe  111   01/02/2003     M         01/01/2001
Jane Doe  222   **03/05/2005**    F         02/02/2002
Jim Doe   333   07/08/2009     M         10/11/2002

Is doing a self-join the right way to go, and if yes, can someone please explain how?

Thanks!

4
Contributors
5
Replies
1 Day
Discussion Span
10 Months Ago
Last Updated
6
Views
cheapterp
Light Poster
35 posts since Jun 2008
Reputation Points: 10
Solved Threads: 2
Skill Endorsements: 0

A self join would make sense in a scenario where let's say you have an employee's table and the table has a column that identifies the employee's manager. Since all employees are in the same table, you perform a self join to figure out which employees a manager has, or a listing of all employees with thier manager's name and other info.

In your example of 2-5, you have unique results. Each row itself is unique from the others.

JorgeM
Industrious Poster
4,024 posts since Dec 2011
Reputation Points: 297
Solved Threads: 549
Skill Endorsements: 115

Modifying this query to satisfy your metadata should remove the duplicates using a self-join.

SELECT name, ID, [date], gender, birthday
FROM <tbl> T
        INNER JOIN (SELECT ID, MAX([date]) MaxDate
                    FROM <tbl>
                    GROUP BY ID) T2
            ON t2.MaxDate = T.[date]
            AND t2.id = T.id

It appears you're querying against data that is normalized like that of a type 2 slowly changing dimension. In that case yes a self join is your best bet for retrieving the most relevant record.

cutepinkbunnies
Junior Poster
157 posts since Apr 2006
Reputation Points: 15
Solved Threads: 9
Skill Endorsements: 0

Using cutepinkbunnies' response, I am now dumping all the data including the "duplicates" into a temp table and then querying against that to get the "unique" entries. The reason I am including a temp table is because the original dataset is a result of 5 inner joins on tables that have thousands of records. No proper indexing on these source tables is slowing down the response significantly.

cheapterp
Light Poster
35 posts since Jun 2008
Reputation Points: 10
Solved Threads: 2
Skill Endorsements: 0

Using cutepinkbunnies' response, I am now dumping all the data including the "duplicates" into a temp table and then querying against that to get the "unique" entries. The reason I am including a temp table is because the original dataset is a result of 5 inner joins on tables that have thousands of records. No proper indexing on these source tables is slowing down the response significantly.

As another solution, you could create a persisted table (or one that will handle an index on the ID column) and use the IGNORE_DUP_KEY option (set to ON that is). Order the data coming from the source query as most relevant first ORDER BY ID, [Date] DESC to get the results you want from the source without all the querying. The Dup Key will reject duplicates of the ID column, if you have them ordered properly, the table will retain only the relevant records.

Just another trick to remove duplicates, especially when you don't want to go messing with indexes for whatever reason. Give it a try if you like, you might find it easier or quicker to implement that logic.

cutepinkbunnies
Junior Poster
157 posts since Apr 2006
Reputation Points: 15
Solved Threads: 9
Skill Endorsements: 0

well hereis a more simpler way
Delete t1 from <tbl_> t1,<tbl_> t2
where t1.ID= t2.ID
and t1.[Date] <t2.[Date]

atinobrian
Newbie Poster
22 posts since Sep 2005
Reputation Points: 17
Solved Threads: 4
Skill Endorsements: 0

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
View similar articles that have also been tagged:
 
© 2013 DaniWeb® LLC
Page rendered in 0.5286 seconds using 2.68MB