Best algorithm for hashing files and avoiding collisions?

Question

evank 0 Newbie Poster

17 Years Ago

I have a collection of (potentially) several hundred thousand image files that I need to generate hash digests for, and I'm unsure of the best algorithm to use. I'll be keying them all to a database based on their hash, so I need the best algorithm for avoiding accidental collisions (where different files end up generating the same hash).

CRC32 is out of the question because I know it's threshold is fairly low, so I'm thinking either MD5 or SHA1, but I don't know which one is better for my purposes or if there's an even better algorithm. Most of what I've found after searching has recommended SHA1 as being 'more secure', but in my case I'm only concerned with accidental collisions rather than intentional malicious ones. Would SHA1 still be the preferable one for my purposes?

algorithm

3 Contributors
6 Replies
281 Views
6 Days Discussion Span
Latest Post 17 Years Ago Latest Post by sarehu

All 6 Replies

sarehu 84 Posting Whiz in Training

17 Years Ago

All of the algorithms MD5, SHA-1, SHA-256, ... are good for avoiding accidental collisions. If the running time is an issue for you, MD5 is faster than SHA-1 by a factor of say, 2, and a similar ratio exists for SHA-1 to SHA-256. That ratio is based on reading one website and on running openssl speed md5 and openssl speed sha1 on my computer.

So use md5 for speed, sha-1 for extra paranoia (it's what Git uses for the purpose, after all), and sha-256 for an algorithm for which no collisions have been found yet, period :-)

evank commented: answered my question and then some, very informative and helpful! +2

jwenting 1,905 duckman

17 Years Ago

No hashing algorithm will guarantee you that there won't be colissions.
All you can do to reduce the occurrance of colissions is to either increase the length of the hash key (and the length of the hash itself) and/or increase the amount of data used to calculate the hash.

Of course either will dramatically increase the amount of time needed to calculate the hash, which may or may not be acceptable.

So the question becomes: how unique do you want your hash to be and why?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

answered my question and then some, very informative and helpful!

evank 0 Newbie Poster · Answer 1 · 2008-03-28T06:42:08+00:00

Well, like I said, I'm generating digests for potentially several hundred thousand image files of varying formats, and using the digests to detect and prevent duplicates. I think SHA1 will work, SHA256 seems a bit of overkill.

Of course if I do encounter more collisions than I expected, it would be easy enough to switch over to SHA256 and just redigest all the files.

sarehu 84 Posting Whiz in Training · Answer 2 · 2008-03-28T21:12:32+00:00

I think SHA1 will work, SHA256 seems a bit of overkill.
Of course if I do encounter more collisions than I expected,

Ha ha ha ha ha. If you unexpectedly encounter a single collision while hashing different images, I will pay you $1000000.

jwenting 1,905 duckman Team Colleague · Answer 3 · 2008-03-29T10:48:36+00:00

you can't use hashes to detect duplicates, because as I said they're never guaranteed to be different for different data.

sarehu 84 Posting Whiz in Training · Answer 4 · 2008-03-29T10:51:35+00:00

sarehu 84 Posting Whiz in Training

17 Years Ago

Yes you can.

Best algorithm for hashing files and avoiding collisions?

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers