I have a collection of (potentially) several hundred thousand image files that I need to generate hash digests for, and I'm unsure of the best algorithm to use. I'll be keying them all to a database based on their hash, so I need the best algorithm for avoiding accidental collisions (where different files end up generating the same hash).

CRC32 is out of the question because I know it's threshold is fairly low, so I'm thinking either MD5 or SHA1, but I don't know which one is better for my purposes or if there's an even better algorithm. Most of what I've found after searching has recommended SHA1 as being 'more secure', but in my case I'm only concerned with accidental collisions rather than intentional malicious ones. Would SHA1 still be the preferable one for my purposes?

Recommended Answers

All 6 Replies

All of the algorithms MD5, SHA-1, SHA-256, ... are good for avoiding accidental collisions. If the running time is an issue for you, MD5 is faster than SHA-1 by a factor of say, 2, and a similar ratio exists for SHA-1 to SHA-256. That ratio is based on reading one website and on running openssl speed md5 and openssl speed sha1 on my computer.

So use md5 for speed, sha-1 for extra paranoia (it's what Git uses for the purpose, after all), and sha-256 for an algorithm for which no collisions have been found yet, period :-)

commented: answered my question and then some, very informative and helpful! +2

No hashing algorithm will guarantee you that there won't be colissions.
All you can do to reduce the occurrance of colissions is to either increase the length of the hash key (and the length of the hash itself) and/or increase the amount of data used to calculate the hash.

Of course either will dramatically increase the amount of time needed to calculate the hash, which may or may not be acceptable.

So the question becomes: how unique do you want your hash to be and why?

Well, like I said, I'm generating digests for potentially several hundred thousand image files of varying formats, and using the digests to detect and prevent duplicates. I think SHA1 will work, SHA256 seems a bit of overkill.

Of course if I do encounter more collisions than I expected, it would be easy enough to switch over to SHA256 and just redigest all the files.

I think SHA1 will work, SHA256 seems a bit of overkill.

Of course if I do encounter more collisions than I expected,

Ha ha ha ha ha. If you unexpectedly encounter a single collision while hashing different images, I will pay you $1000000.

you can't use hashes to detect duplicates, because as I said they're never guaranteed to be different for different data.

Yes you can.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.