Hi,

Is it possible to compare files by hash code? If so how do I do this. I am attempting to locate duplicates. Simply calling the file.sashCode() generates a hash code but if a duplicate file is located the hash code if different, depending on name and location. Naturally you cannot have the exact same file name in the same directory. Is it a case of reading the files then converting that to a hash code then making the comparison?


many thanks.

With the hashCode() you compare objects. It's no different than using equals()

So if you use: File.hashCode() or File.equals() you will compare 2 file objects. Now I don't know the rules that these methods were written but I presume that they compare the path of the file they refer to.
So by doing this:

File f1 = new File("fileName.txt");
File f2 = new File("fileName.txt");

Then equal will return true.
But:

File f1 = new File("fileName1.txt");
File f2 = new File("fileName2.txt");

I think that equal or hashCode will NOT return true, even if they have the same context. (but check it out anyway).


If you have 2 different files in the file system and want see they have the same context, then try this:
Use FileInputStream to read their bytes in an array and compare them

They compare the file objects and not the actual files that these file objects refer to.

You need to actually read the files and compare then byte for byte until a difference is found (although, I would say to check the sizes first, as that is fast and if they are not the same size, there is no reason to compare them).

AFAIK, the implementation of hashCode() and equals method of File class depends on the underlying file system. On Windows files systems, the hashCode() method of the File class is actually the hash code of the result of getPath() after converting it to lowercase. The equals() method compares the result of getPath() ignoring its case. Since path names in *nix systems are case sensitive, the implementation might adjust itself accordingly.

The way this problem can be approached depends on what kind of a comparison are you looking for. If you actually want to compare the file contents, you are stuck with byte by byte comparison [as suggested by masijade] or computing and comparing the md5 checksum of both the files.

If Java is not a mandate, I am sure there are better platform specific ways of doing this.

"computing and comparing the md5 checksum" This would mean that any change in the file no matter how small would change the md5 checksum. I do not think this would do. The comparison needs to be so robust that even if duplicate files exist with slight changes, I would need for them to be picked up. I suppose the reading byte by byte may be the solution, if more than half the bytes are identical one may assume the file is the same. I suppose there is no really robust way of identifying duplicate files.


Do you know what the hashCode() method of the file class computes on a Mac?

Many thanks

> The comparison needs to be so robust that even if duplicate files exist with slight changes

You use the word `robust' in the wrong context here; the implementation would be robust if it picks up even the smallest of changes in the file. The thing you are looking here for is approximation based comparison.

> Do you know what the hashCode() method of the file class computes on a Mac?

Most probably the hashCode of the return value of File#getPath ORed with some random number. But that's besides the point since hashCode computation has got nothing to do with the file contents.

> I suppose the reading byte by byte may be the solution, if more than half the bytes are
> identical one may assume the file is the same.

This really doesn't make any sense. What exactly are you trying to achieve here?

What I am attempting to do is identify duplicate files. When I say duplicate I mean both 100% identical and those that are not necessarily 100% identical e.g. one file may have a different lastModified() date and its contents may vary slightly but are essentially the same.

Some files may have different names but identical contents as such, are duplicates.

What I would like to know is by what means can I do this. file.getName() can find files of the same name but these are not necessarily duplicates e.g. license.txt files for recently installed applications come ups as duplicates on my program when using the file.getName() for comparison, likewise file.length() proves to be flawed as well as many files are of the same size but are not duplicates.

I hope this makes it clearer.

Many thanks

> its contents may vary slightly but are essentially the same.

Define *essentially the same* in a unambiguous manner. To prove how lost the above statement is consider three files:

There is a lot of ambiguity in what you say.

There i a lot of ambiguity in what you say.

There is a lot of ambiguity in what you think.

Do you think they are *slightly* different? How would you go about actually proving it?

> license.txt files for recently installed applications come ups as duplicates on my program
> when using the file.getName() for comparison

If your aim is content comparison, checking the file names / path is kind of pointless since a directory can't have two files with exactly the same name.

> So CRC32, Adler32, or MD5 checksums will not work for you?

Given his definition of comparison it won't since they aim for an exact match of contents.

OK.

Say I have File1.txt with contents being "ABC".
Then I have File2.txt with contents being "ABC" different is the name only, yes. Both in dir A.

Then I have File1.txt in another dir say B copied from the original File1.txt from dir A above; the contents being "ABCD" slightly adjusted. Now you have three files 2 which are identical save in name while the third is very similar but was copied from the original and adjusted slightly.

This is common when people send you reports by E-mail. You download it and realise that they need to make some slight adjustments. They do so and send another which you download again this time they name it final copy allowing you to save it in the same dir. Now you have the old copy and the new copy. this is a subtle form of duplication, no? I am attempting to identify these approximately similar files as well as blatant duplicate files.

I hope that is clearer.

Many thanks

The thing you fail to realize here what seems to be a *slight adjustment* or *subtle* to you isn't that *subtle* to the program. Maybe what you are looking for is fuzzy comparison which IMO doesn't make sense when comparing binary files; it would work pretty good for text files. You need to apply techniques like finding the distance between two string sequences; read this. Be warned though, the implementation for this definitely won't be a trivial one.

HTH,
sos.

This article has been dead for over six months. Start a new discussion instead.