I have a bunch of records that I get from doing multiple joins.

# Name,   ID,     Date,      Gender,     Birthday #
John Doe  111   01/02/2003     M         01/01/2001
Jane Doe  222   03/04/2005     F         02/02/2002
Jane Doe  222   **03/05/2005**     F         02/02/2002
Jim Doe   333   07/08/2009     M         10/11/2012

What I am trying to get is the below. The records vary only in one column (in this case, it's the date column). All other columns are exactly same. 

Note that I only want one row for Jane Doe. This should be the row where date is newer (or more recent).
# Name,   ID,     Date,      Gender,     Birthday #
John Doe  111   01/02/2003     M         01/01/2001
Jane Doe  222   **03/05/2005**    F         02/02/2002
Jim Doe   333   07/08/2009     M         10/11/2002

Is doing a self-join the right way to go, and if yes, can someone please explain how?

Thanks!

Recommended Answers

All 5 Replies

A self join would make sense in a scenario where let's say you have an employee's table and the table has a column that identifies the employee's manager. Since all employees are in the same table, you perform a self join to figure out which employees a manager has, or a listing of all employees with thier manager's name and other info.

In your example of 2-5, you have unique results. Each row itself is unique from the others.

Modifying this query to satisfy your metadata should remove the duplicates using a self-join.

SELECT name, ID, [date], gender, birthday
FROM <tbl> T
        INNER JOIN (SELECT ID, MAX([date]) MaxDate
                    FROM <tbl>
                    GROUP BY ID) T2
            ON t2.MaxDate = T.[date]
            AND t2.id = T.id

It appears you're querying against data that is normalized like that of a type 2 slowly changing dimension. In that case yes a self join is your best bet for retrieving the most relevant record.

commented: This worked perfectly! +0

Using cutepinkbunnies' response, I am now dumping all the data including the "duplicates" into a temp table and then querying against that to get the "unique" entries. The reason I am including a temp table is because the original dataset is a result of 5 inner joins on tables that have thousands of records. No proper indexing on these source tables is slowing down the response significantly.

Using cutepinkbunnies' response, I am now dumping all the data including the "duplicates" into a temp table and then querying against that to get the "unique" entries. The reason I am including a temp table is because the original dataset is a result of 5 inner joins on tables that have thousands of records. No proper indexing on these source tables is slowing down the response significantly.

As another solution, you could create a persisted table (or one that will handle an index on the ID column) and use the IGNORE_DUP_KEY option (set to ON that is). Order the data coming from the source query as most relevant first ORDER BY ID, [Date] DESC to get the results you want from the source without all the querying. The Dup Key will reject duplicates of the ID column, if you have them ordered properly, the table will retain only the relevant records.

Just another trick to remove duplicates, especially when you don't want to go messing with indexes for whatever reason. Give it a try if you like, you might find it easier or quicker to implement that logic.

well hereis a more simpler way
Delete t1 from <tbl_> t1,<tbl_> t2
where t1.ID= t2.ID
and t1.[Date] <t2.[Date]

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.