anyone know how genetic programming works on finding duplicate records in database?

Recommended Answers

All 4 Replies

How would you structure such an algorithm? A genetic algorithm won't be better than an algorithm specifically designed for solving your problem. Genetic algorithms are generally useful for problems where there isn't a known solution or that solution is not very efficient.

Remember that in your problem every item must be compared to every other item. At best an algorithm for n items will make a total of n + (n - 1) + (n - 2) + ... comparisons. A genetic algorithm can take thousands of generations to become useful. I'm not an expert, but I wouldn't recommend using genetic algorithms to look for duplicates.

Duplicate finding usually is simplest by just doing sorted copy and removing the duplicates (sequentiailly) from the ordered data.

If you would use genetic programming, you would make it to find a algorithm better than this.

----------- Simplistic approach --------------

If the problem is comparing field by field to check for mismatch, then evolving the field by field ordering to maximize the short circuit mismatches would be accomplished via a genetic algorithm. The genome would be the field order. This is similar to a traveling salesman problem of generating an optimal order list.

The fitness would be the elapsed time for running against the database. This could take into account the expense of a comparing large text fields later versus earlier in the ordering. If 5 quick compare fields find the same frequency of mismatch as 1 slow campare field, then the 5 might be better placed first.

-------------------- Alternately -----------------

I know next to nothing about how duplicates are detected. Here are some thoughts.

It would seem using a hash of the records and sorting these to find duplicates would be viable. Optimizing the hashing algorithm to minimize duplicates could also be an evolved solution. The fitness being the duplicate count. The assumption is the hash is a stored value and not computed at duplicate detection time.

One could then uses multiple hashes, 2 or more, to avoid having the do a field by field compare. Ultimately a complete compare may be needed and an evolved field by field check would ultimately detect a duplicate.

--------------------- Addtional thought -------------------

One could imagine expensive compare fields having an associated hash stored in the database. These hash field's postion in the field by field ordering would evolve toward the beginning of the order list as it would be more efficient at short-circuiting the compare.

If one is evolving a field by field ordered list for the compare then as mentioned this is very similar to the traveling salesperson problem. Visit each city once with the shortest distance.

There have been a number of representation approaches to the genome that ensure the next generation always represents a valid path. Crossover would typically create invalid offspring as some cities would be lost and others represented twice in the 2 children.

So looking at representation approaches used in TSP GA might be beneficial.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.