## resmi sanker

anyone know how genetic programming works on finding duplicate records in database?

## dmanw100 36

How would you structure such an algorithm? A genetic algorithm won't be better than an algorithm specifically designed for solving your problem. Genetic algorithms are generally useful for problems where there isn't a known solution or that solution is not very efficient.

Remember that in your problem every item must be compared to every other item. At best an algorithm for n items will make a total of n + (n - 1) + (n - 2) + ... comparisons. A genetic algorithm can take thousands of generations to become useful. I'm not an expert, but I wouldn't recommend using genetic algorithms to look for duplicates.

## pyTony 888

Duplicate finding usually is simplest by just doing sorted copy and removing the duplicates (sequentiailly) from the ordered data.

If you would use genetic programming, you would make it to find a algorithm better than this.

## James_6

----------- Simplistic approach --------------

If the problem is comparing field by field to check for mismatch, then evolving the field by field ordering to maximize the short circuit mismatches would be accomplished via a genetic algorithm. The genome would be the field order. This is similar to a traveling salesman problem of generating an optimal order list.

The fitness would be the elapsed time for running against the database. This could take into account the expense of a comparing large text fields later versus earlier in the ordering. If 5 quick compare fields find the same frequency of mismatch as 1 slow campare field, then the 5 might be better placed first.

-------------------- Alternately -----------------

I know next to nothing about how duplicates are detected. Here are some thoughts.

It would seem using a hash of the records and sorting these to find duplicates would be viable. Optimizing the hashing algorithm to minimize duplicates could also be an evolved solution. The fitness being the duplicate count. The assumption is the hash is a stored value and not computed at duplicate detection time.

One could then uses multiple hashes, 2 or more, to avoid having the do a field by field compare. Ultimately a complete compare may be needed and an evolved field by field check would ultimately detect a duplicate.