> it's designed to be a secure hash function, and trades
> performance for that security
Maybe you got that the other way around since a lot of vulnerabilities or staged attacks have been found with MD5 making it a unsafe choice for security applications.
Well, that's also true, but as a
design principle, MD5 (and other "secure" hash functions) trades performance for security.
> a good-quality 64-bit hash function should be ample
Doesn't seem like a wise choice given that the MD5 algorithm which used 128 bit has been compromised.
As a
secure hash function, it would be totally useless. But so what? What the poster wants to do is find duplicate files on a machine. Unless they're trying to guard against the attack whereby somebody deliberately plants a file on the machine to fool the file scanner, they don't actually need a secure hash function-- they just need a reasonable quality one. If they
are trying to protect against that attack, then they need to weigh up the (at the moment minimal) security risk of MD5 vs other more secure-- but potentially more expensive and memory-consuming-- hashes.
> in Java is that Java provides no way to signal to the operating
> system to read files without going via the file cache
File cache?
Yep-- the OS will generally automatically cache data read from file. This is generally useful-- on average there's a reasonable chance you'll read enough of that data again to make it worthwhile. But in a few minority cases such as the file scanner (other examples might include, say, processing a database transaction log), you know in advance that you only read the data
once, so it will overall have a negative impact on performance for the OS to cache that data. Generally, OS's provide a means to say "please read this data without caching". But Java does not expose that facility.