Self-join to remove duplicates

Question

cheapterp 0 Light Poster

12 Years Ago

I have a bunch of records that I get from doing multiple joins.

# Name,   ID,     Date,      Gender,     Birthday #
John Doe  111   01/02/2003     M         01/01/2001
Jane Doe  222   03/04/2005     F         02/02/2002
Jane Doe  222   **03/05/2005**     F         02/02/2002
Jim Doe   333   07/08/2009     M         10/11/2012

What I am trying to get is the below. The records vary only in one column (in this case, it's the date column). All other columns are exactly same. 

Note that I only want one row for Jane Doe. This should be the row where date is newer (or more recent).
# Name,   ID,     Date,      Gender,     Birthday #
John Doe  111   01/02/2003     M         01/01/2001
Jane Doe  222   **03/05/2005**    F         02/02/2002
Jim Doe   333   07/08/2009     M         10/11/2002

Is doing a self-join the right way to go, and if yes, can someone please explain how?

Thanks!

mssql

4 Contributors
5 Replies
3K Views
1 Day Discussion Span
Latest Post 12 Years Ago Latest Post by atinobrian

All 5 Replies

cutepinkbunnies 4 Junior Poster

12 Years Ago

Modifying this query to satisfy your metadata should remove the duplicates using a self-join.

SELECT name, ID, [date], gender, birthday
FROM <tbl> T
        INNER JOIN (SELECT ID, MAX([date]) MaxDate
                    FROM <tbl>
                    GROUP BY ID) T2
            ON t2.MaxDate = T.[date]
            AND t2.id = T.id

It appears you're querying against data that is normalized like that of a type 2 slowly changing dimension. In that case yes a self join is your best bet for retrieving the most relevant record.

Edited 12 Years Ago by cutepinkbunnies because: Revised aggregation

cheapterp commented: This worked perfectly! +0

atinobrian 7 Newbie Poster

12 Years Ago

well hereis a more simpler way
Delete t1 from <tbl_> t1,<tbl_> t2
where t1.ID= t2.ID
and t1.[Date] <t2.[Date]

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

JorgeM 958 Problem Solver Team Colleague Featured Poster · Answer 1 · 2012-07-11T17:49:12+00:00

A self join would make sense in a scenario where let's say you have an employee's table and the table has a column that identifies the employee's manager. Since all employees are in the same table, you perform a self join to figure out which employees a manager has, or a listing of all employees with thier manager's name and other info.

In your example of 2-5, you have unique results. Each row itself is unique from the others.

cheapterp 0 Light Poster · Answer 2 · 2012-07-12T17:07:26+00:00

Using cutepinkbunnies' response, I am now dumping all the data including the "duplicates" into a temp table and then querying against that to get the "unique" entries. The reason I am including a temp table is because the original dataset is a result of 5 inner joins on tables that have thousands of records. No proper indexing on these source tables is slowing down the response significantly.

cutepinkbunnies 4 Junior Poster · Answer 3 · 2012-07-12T20:39:25+00:00

Using cutepinkbunnies' response, I am now dumping all the data including the "duplicates" into a temp table and then querying against that to get the "unique" entries. The reason I am including a temp table is because the original dataset is a result of 5 inner joins on tables that have thousands of records. No proper indexing on these source tables is slowing down the response significantly.

As another solution, you could create a persisted table (or one that will handle an index on the ID column) and use the IGNORE_DUP_KEY option (set to ON that is). Order the data coming from the source query as most relevant first ORDER BY ID, [Date] DESC to get the results you want from the source without all the querying. The Dup Key will reject duplicates of the ID column, if you have them ordered properly, the table will retain only the relevant records.

Just another trick to remove duplicates, especially when you don't want to go messing with indexes for whatever reason. Give it a try if you like, you might find it easier or quicker to implement that logic.

Self-join to remove duplicates

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers