Is Java the wrong language for a duplicate file scanner?

Please support our Java advertiser: Programming Forums - DaniWeb Sister Site
Reply

Join Date: Nov 2008
Posts: 37
Reputation: caps_lock is an unknown quantity at this point 
Solved Threads: 0
caps_lock caps_lock is offline Offline
Light Poster

Is Java the wrong language for a duplicate file scanner?

 
0
  #1
Nov 14th, 2008
So i've downloaded netbeans, and had a play around.

But I'm thinking is Java the wrong language to design a duplicate file scanner.

The program if not obvious should hopefully atleast identify any files that exist on a computer more than once (copies).

Im also going to try and make it available online, a bit similar to online virus scanning websites.

Ive got the website up in HTML, but im willing to scrap that if the duplicate file scanner cant be added to the website code as a plugin / object etc.

Is java the wrong language, is netbeans not the best IDE? Shall I consider AJAX
Reply With Quote Quick reply to this message  
Join Date: Sep 2008
Posts: 1,647
Reputation: BestJewSinceJC is a splendid one to behold BestJewSinceJC is a splendid one to behold BestJewSinceJC is a splendid one to behold BestJewSinceJC is a splendid one to behold BestJewSinceJC is a splendid one to behold BestJewSinceJC is a splendid one to behold 
Solved Threads: 206
BestJewSinceJC BestJewSinceJC is offline Offline
Posting Virtuoso

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #2
Nov 14th, 2008
Personally I would write this in C over Java. Only because Java is Object oriented and the problem you have posed isn't. Well sort of, but even more the fact that C will probably be much faster than Java for this project.


Also, this topic has been posted on daniweb before, so you might want to try to find the thread. I'm not sure how you plan to implement it, but in any case, be sure to consider something: does the file have to match exactly in order to be considered a match? Or will it be based on what % of the file matches? In either case, be sure to discontinue scanning the file at any point if its determined it doesn't match. You might want to consider using Tries. See Trie based cheat checker.
Last edited by BestJewSinceJC; Nov 14th, 2008 at 9:24 pm.
Reply With Quote Quick reply to this message  
Join Date: Jun 2008
Posts: 973
Reputation: Alex Edwards is a jewel in the rough Alex Edwards is a jewel in the rough Alex Edwards is a jewel in the rough Alex Edwards is a jewel in the rough 
Solved Threads: 107
Alex Edwards's Avatar
Alex Edwards Alex Edwards is offline Offline
Posting Shark

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #3
Nov 20th, 2008
If there is already a class available that can search through all of the files on your System, why not make a Decorator class or a class that Maps files (or file names) to a number, and if the file is encountered again, replace the value with an incremented form of the same value, then return the map?

-Alex
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 37
Reputation: caps_lock is an unknown quantity at this point 
Solved Threads: 0
caps_lock caps_lock is offline Offline
Light Poster

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #4
Jan 4th, 2009
thanks for the replys guys


but...

i dont want to use C - I'm not bothered about speed, loading or scanning duration of the intended application

and i find the post before mine abit confusing

If anyone here has information or a link to information that will help me incorporate a Md5 checksum to a GUI please let me know

Nothing complication because I want to be able to understand how it and specific code works!
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 37
Reputation: caps_lock is an unknown quantity at this point 
Solved Threads: 0
caps_lock caps_lock is offline Offline
Light Poster

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #5
Jan 4th, 2009
also the application is stand alone well operating in a Java Runtime Environment
Reply With Quote Quick reply to this message  
Join Date: Mar 2007
Posts: 686
Reputation: sillyboy is on a distinguished road 
Solved Threads: 61
sillyboy's Avatar
sillyboy sillyboy is offline Offline
Practically a Master Poster

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #6
Jan 4th, 2009
I am not sure you will find a specific article relating md5 to a java GUI, but it isn't really different to just md5 in a command-line environment. If you can get your checksums generating, just attach it to a GUI, by events or whatever your requirement may be.
Reply With Quote Quick reply to this message  
Join Date: Dec 2008
Posts: 53
Reputation: neilcoffey will become famous soon enough neilcoffey will become famous soon enough 
Solved Threads: 6
neilcoffey neilcoffey is offline Offline
Junior Poster in Training

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #7
Jan 5th, 2009
I can't see a problem with writing this in Java. For the checksums, you could look at the MessageDigest class, or just use some other fairly strong hash function. MD5 isn't necessarily the best choice: it's designed to be a secure hash function, and trades performance for that security. IMO, what you need for duplicate recognition is just a "fairly strong" hash function: a good-quality 64-bit hash function should be ample. The advantage of MD5 is simply that you get it "out of the box". You could also try using a large buffer for your I/O and doing it in a separate thread to the hash calculation-- it may give a gain on some systems (multiprocessor and uniprocessor).

A very sliiight downside to writing this program in Java is that Java provides no way to signal to the operating system to read files without going via the file cache-- so you get slightly slower one-off reads than are in principle possible on most OSs.
Reply With Quote Quick reply to this message  
Join Date: Jun 2006
Posts: 7,649
Reputation: ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of 
Solved Threads: 474
Super Moderator
Featured Poster
~s.o.s~'s Avatar
~s.o.s~ ~s.o.s~ is offline Offline
Failure as a human

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #8
Jan 8th, 2009
> it's designed to be a secure hash function, and trades
> performance for that security

Maybe you got that the other way around since a lot of vulnerabilities or staged attacks have been found with MD5 making it a unsafe choice for security applications.

> a good-quality 64-bit hash function should be ample

Doesn't seem like a wise choice given that the MD5 algorithm which used 128 bit has been compromised.

IMO since many still sites still use the MD5 checksum to check the integrity of downloaded files, the situation with MD5 isn't as bad as it seems.

> in Java is that Java provides no way to signal to the operating
> system to read files without going via the file cache

File cache?
I don't accept change; I don't deserve to live.

Jo Tujhe Jagaaye, Nindein Teri Udaaye Khwaab Hai Sachcha Wahi.
Nindon Mein Jo Aaye Jise To Bhul Jaaye Khawab Woh Sachcha Nahi.
Khwaab Ko Raag De, Nind Ko Aag De
Reply With Quote Quick reply to this message  
Join Date: Dec 2008
Posts: 53
Reputation: neilcoffey will become famous soon enough neilcoffey will become famous soon enough 
Solved Threads: 6
neilcoffey neilcoffey is offline Offline
Junior Poster in Training

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #9
Jan 8th, 2009
Originally Posted by ~s.o.s~ View Post
> it's designed to be a secure hash function, and trades
> performance for that security

Maybe you got that the other way around since a lot of vulnerabilities or staged attacks have been found with MD5 making it a unsafe choice for security applications.
Well, that's also true, but as a design principle, MD5 (and other "secure" hash functions) trades performance for security.

Originally Posted by ~s.o.s~ View Post
> a good-quality 64-bit hash function should be ample

Doesn't seem like a wise choice given that the MD5 algorithm which used 128 bit has been compromised.
As a secure hash function, it would be totally useless. But so what? What the poster wants to do is find duplicate files on a machine. Unless they're trying to guard against the attack whereby somebody deliberately plants a file on the machine to fool the file scanner, they don't actually need a secure hash function-- they just need a reasonable quality one. If they are trying to protect against that attack, then they need to weigh up the (at the moment minimal) security risk of MD5 vs other more secure-- but potentially more expensive and memory-consuming-- hashes.

Originally Posted by ~s.o.s~ View Post
> in Java is that Java provides no way to signal to the operating
> system to read files without going via the file cache

File cache?
Yep-- the OS will generally automatically cache data read from file. This is generally useful-- on average there's a reasonable chance you'll read enough of that data again to make it worthwhile. But in a few minority cases such as the file scanner (other examples might include, say, processing a database transaction log), you know in advance that you only read the data once, so it will overall have a negative impact on performance for the OS to cache that data. Generally, OS's provide a means to say "please read this data without caching". But Java does not expose that facility.
Reply With Quote Quick reply to this message  
Join Date: Jun 2006
Posts: 7,649
Reputation: ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of ~s.o.s~ has much to be proud of 
Solved Threads: 474
Super Moderator
Featured Poster
~s.o.s~'s Avatar
~s.o.s~ ~s.o.s~ is offline Offline
Failure as a human

Re: Is Java the wrong language for a duplicate file scanner?

 
0
  #10
Jan 8th, 2009
Re MD5: I guess we are talking about two different things here: security and file uniqueness, hence the confusion. I would personally use MD5 hashing since it seems to be a widely used technique for testing file uniqueness and optimize if and only if required. Also, coming up with a good hash solution [if that is what you were suggesting to the OP] is far from a walk in the park, though it seems to be a good exercise in learning more about hash functions.

> Yep-- the OS will generally automatically cache data read from file.

If you are talking about kernel file caching which results from reliance on system calls to do I/O, memory mapping the file solves that issue.
I don't accept change; I don't deserve to live.

Jo Tujhe Jagaaye, Nindein Teri Udaaye Khwaab Hai Sachcha Wahi.
Nindon Mein Jo Aaye Jise To Bhul Jaaye Khawab Woh Sachcha Nahi.
Khwaab Ko Raag De, Nind Ko Aag De
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Other Threads in the Java Forum


Views: 1240 | Replies: 11
Thread Tools Search this Thread



Tag cloud for Java
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC