Suppose we have got ten computers in a sub net , each of which is also connected to internet.
Each PC has lot of documents , user can also access internet and save web page as a document,
Now how can we scan through the sub net to find out all duplicate documents,its location
a brute force way could be to use document name and size, but same content might be stored with different name even size can vary little bit.
Has anyone come across this problem before? Is C++ best language to try to solve this problem ?

since the documents are html pages, you would need to remove the html formatting tags and all white space characters. (ie. just get the plain text of the document). this is required so that differences in fonts/character sizes/colours as well as differences in encoding of newlines etc. do not influence your check. now compute the MD5 checksum and the SHA256 checksum for it. if two documents have identical checksums for both SHA256 and MD5, conclude that they are the same document.

This article has been dead for over six months. Start a new discussion instead.