Hey there,

I have a code to remove the duplicates in one of my table but I cant seem to be able to ORDER the way mysql select the values.

Here is my initial table: http://img196.imageshack.us/img196/2421/tablesw.jpg

mysql_query("CREATE TABLE new_table as SELECT * FROM test GROUP BY url") 
mysql_query("DROP TABLE test") 
mysql_query("RENAME TABLE new_table TO test")

I want to remove the duplicated URL and make sure that it keeps only the value with the higher PR. I tried using the ORDER BY attribute but it doesnt seem to work. The first 3 values with a PR of 0 are always selected and placed in the new table...

The good result should be that the 3 values (a,b,c) with a PR of 6 are found in the final table.

I tried this and it doesnt work:

mysql_query("CREATE TABLE new_table as SELECT * FROM test GROUP BY url ORDER BY pr DESC")

I am looking for a fast way to achieve this because my table have over 200k entries.


Member Avatar

If you want to insert all rows with PR >= 0 in the new table then follow this query (or alter the condition as required):

create table new_table as select url,pr from tbl where pr >= 0 order by pr desc;

Actually the order by is not required, it is just to put the rows with higher rank on the top. It doesn't matter, you can always extract the thing using parameters in select query.

if i understand you correctly, you would like to delete all duplicate urls and only the url with the maximum pr value should remain.

As for your example after deleting the duplicates the result set should then be:
url  pr
a    6
b    6
c    6

The below code functions well on ms sqlserver and sybase:
delete from mytable x from mytable x, mytable y
  where x.url = y.url and x.pr < y.pr;

In my case I had had some more columns to check whether duplicates exist. I took this old code and reduced the recursive references to only x and y to meet your requirement where only column URL exists containing duplicates. I hope MySQL has similar syntax and it will work. (You may also try on MySQL this: delete x from mytable x ... Furthermore, instead of second FROM clause MySQL possibly supports USING clause, so look at SQL Reference version 5.1...)

-- tesu

Addition: Delete statement must be finalized with COMMIT

-- tesu

Member Avatar

Addition: Delete statement must be finalized with COMMIT

-- tesu

No its not required. COMMIT will produce an error if a transaction is not running and need not be used unless START TRANSACTION is issued before doing any Insert/Update/Delete queries.

Oh yes itech7, you are in the right!

Today there are database systems which require that a single logical unit of work be surrounded by START TRANSACTION and COMMIT (or sometimes ROLLBACK). However, there are highly professional DBMS where START TRANSACTION is not mandatory, for example ORACLE, except for MySQL.

Nevertheless, the COMMIT (or sometimes ROLLBACK) after the above-mentioned INSERT statement is unambiguously required for any INSERT must always be part of a single logical unit of work unless you are a real gambler always playing Russian Roulette.

One should also think of that data inserted by INSERT without a final COMMIT will not be persistently stored in database (again, unless you are a real gambler always playing Russian Roulette).

-- tesu

Sorry, in my above posting INSERT should be replaced by DELETE to refer to the original post! My given considerations concern to INSERT, UPDATE and DELETE though.

-- tesu

Member Avatar

I have used Oracle 9.1 only which required those COMMIT statments after Insert/Update/Delete. It hasn't changed its behavior in the recent versions too ?
PostgreSQL (often called Opensource Oracle), behaves like MySQL in the transaction part. If you issue a START TRANSACTION then you must COMMIT it, otherwise its not needed.