Find the most frequently used words across multiple files

Question

Anamika15 0 Newbie Poster

10 Years Ago

Program is to find the most frequently used words across all the input files, where each word must appear at least once in each file.
How can this be achieved? I am new to java and yet to dig things deep. So trying through such programs.

java

Edited 10 Years Ago by Anamika15

7 Contributors
18 Replies
760 Views
5 Days Discussion Span
Latest Post 10 Years Ago Latest Post by JamesCherrill

All 18 Replies

Slavi 94 Master Poster

10 Years Ago

What do you have so far?

JamesCherrill 4,733 Most Valuable Poster

10 Years Ago

the algorithm requires reading each file twice

?

Read first file and count words in a master map<word, count>.
for every other file:
   read file & count words in a temp map
   for each entry in master map:
      if temp has the same word, add its count into master
      else delete it from master
// master now has total count for words that appear at least once in every file

Taywin 312 Posting Virtuoso

10 Years Ago

Sorry, each file content twice.

No, that algorithm is not correct. For example, a word 'blah' found in the first iteration, so it is added. The second file doesn't have it, so it is removed. Then the third file has it, so it is added again. Therefore, it will count words wrong.

Or if you said the master map contains all words already, then you have to read all files once to build the master map data, which is what I said -- read file [content] twice. ;)

The requirement needs to simutaneous look at all files. In other words, all words info must be stored in the memory to counted and compared. My algorithm read them through once and collect info when each word is read. It may not be much faster if there are not many files and each file is small, but it should show much difference when the problem scales. Of course, you trade speed with space in this case.

Edited 10 Years Ago by Taywin

Taywin 312 Posting Virtuoso

10 Years Ago

Why do you think anything has to be read twice unless you mean to say that reading the hashmap created out of the file is the same as "reading the data twice".

Yes, you could say that too :) It is just less data after the whole data is processed once. :)

Taywin 312 Posting Virtuoso

10 Years Ago

Ah ok, got your algorithm. Sorry. :)

Edited 10 Years Ago by Taywin

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 1 · 2014-07-31T07:28:10+00:00

There are lots of people here who will freely give their time to help you become the best Java programmer you can be. There's nobody here who is interested in helping you cheat or doing your homework for you.

DaniWeb Member Rules (which you agreed to when you signed up) include:
"Do provide evidence of having done some work yourself if posting questions from school or work assignments"
http://www.daniweb.com/community/rules

Post what you have done so far and someone will help you from there.

jwenting 1,905 duckman Team Colleague · Answer 2 · 2014-07-31T10:20:44+00:00

load file
count occurance of each word
repeat for all files
win

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 3 · 2014-07-31T11:53:21+00:00

JamesCherrill 4,733 Most Valuable Poster

10 Years Ago

Using Java 8, or an inferior older version of Java?

Anamika15 0 Newbie Poster · Answer 4 · 2014-08-01T21:06:12+00:00

I forgot to post the work I did. Apologies for that.

a)Create a HashMap<String,Integer> which would store each word and its occurrences across all files.
b) Fill this map with the occurrences per word in first file.
c) Initiate threads (1 per file) which will read all the contents of the file first to another map and then loop over the HashMap to identify which word is not present. Whichever is not present, we remove from the HashMap thereby ensuring that only the ones that exist in all files are retained.

The problem I am facing is to get the final map after all the threads are executed. Assume I have to print the output to the cmd prompt. Then I have to get the final map and print them. As of now I have put the print stmt in the run method itself. So its printing the content everytime a thread executes. But I want final map and print only once. So how to get the final map after all the threads are executed?

I know to achieve this my traditional method like load each file,count occurrence of each word, check for the word and do necessary operations. But this isn't efficient right? So thought of multithreading.

Please let me know if any better approach can be used to achieve the solution for the problem.

iamthwee · Answer 5 · 2014-08-01T22:42:04+00:00

Multithreading isn't necessary, you could probably do it without multi-threading. Is there a requirement for multithreading?

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 6 · 2014-08-02T07:27:54+00:00

Using threads for parallel processing of a lot of files sounds perfectly sensible to me.

Have a look at Thread's join() method - you can use this to allow your initial thread to wait until another thread has finished before it prints the final result.

However, join() doesn't scale in a tidy way to lots of threads, so you may prefer:
Keep a counter of the number of threads you have started. When each thread has finished it decrements the counter and calls notifyAll().
In your main thread use a while(counter > 0) wait(); loop to wait until all the threads have finished.
(Use an AtomicInteger for the counter to avoid concurrency problems with updating the counter)

However, starting n threads for n files may not scale sensibly for many many files, in which case you could use a ThreadPoolExecutor to process the files using a limited number of threads. You can create a pool easily with Executors.newFixedThreadPool(int) When all the files have been scheduled for processing you can use the awaitTermination method to wait for them all to be finished.

So there you have 3 possible approaches, in increasing order of learning curve, but in my opinion the last one is by far the best, and well worth the investment of effort.

ps Your main HashMap will be accessed from multiple concurrent threads, so you need to think about thread safety. A ConcurrentHashMap may make that easier.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 7 · 2014-08-03T15:21:59+00:00

Afterthought:
Although the ThreadPoolExcutor approach, with all its options, is daunting, in its simplest form it's really simple. Here's a tiny runnable demo:

First, a Runnable that is the class of the tasks you want to perform in multiple threads. This demo has tasks that wait "n" seconds, then prints a message and terminate.

   class Task implements Runnable {

      int waitSecs;

      Task(int waitSecs) {
         this.waitSecs = waitSecs;
      }

      @Override
      public void run() {
         try {
            Thread.sleep(waitSecs*1000);
         } catch (InterruptedException ex) {
         }
         System.out.println("task " + waitSecs + " done");
      }

   }

Now the real stuff - create thread pool executor, add some tasks to it, tell it "that's all", and wait for them all to finish:

      System.out.println("Starting");
      ExecutorService x = Executors.newFixedThreadPool(10); // max 10 threads

      x.execute(new Task(3));
      x.execute(new Task(2));
      x.execute(new Task(1));

      x.shutdown();
      try {
         x.awaitTermination(1, TimeUnit.MINUTES);
      } catch (InterruptedException ex) {
         ex.printStackTrace();
      }
      System.out.println("all done");

it really is that simple, and now gets my 100% recommendation for the way to go.
J

Taywin 312 Posting Virtuoso · Answer 8 · 2014-08-04T15:40:30+00:00

If your problem requires that each word in the result must occur at least once in each file, I doubt that your Hash<String, Integer> is good enough to meet the requirement. The reason is that you can only collect what words and how often the word occurs, but you cannot verify that the word occurs at least once in each read file.

What you may need to add to the stucture is an array of file names you have looked through. So the HashMap will need String as key (for words) and a class (Hash<String, YourCustomizedClass>). The class contains an array of file names that have the word and the total count of word found. When a word is found, add the file name to the list if not exists, and increment the total number of word found. When search for the most occurence word in the result, pick only words that the size of file name array is equal to the total number of files read.

I think that should be enough...

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 9 · 2014-08-04T15:48:42+00:00

JamesCherrill 4,733 Most Valuable Poster

10 Years Ago

I think Anamika's original algorithm is a lot simpler (see step c).

Edited 10 Years Ago by JamesCherrill

Taywin 312 Posting Virtuoso · Answer 10 · 2014-08-04T19:16:58+00:00

c) Initiate threads (1 per file) which will read all the contents of the file first to another map and then loop over the HashMap to identify which word is not present. Whichever is not present, we remove from the HashMap thereby ensuring that only the ones that exist in all files are retained.

Yes, it is easier but the algorithm requires reading each file twice. :) Though, the explanation of the step didn't say that (or there is no way to compared the counted word).

~s.o.s~ 2,560 Failure as a human Team Colleague Featured Poster · Answer 11 · 2014-08-05T11:33:47+00:00

This is a typical map-reduce problem. You map the files you read and create intermediate data structures (dicts in this case). And after all the files have been read, you "reduce" the intermediate data structures to create a final data structure. Why do you think anything has to be read twice unless you mean to say that reading the hashmap created out of the file is the same as "reading the data twice". :)

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 12 · 2014-08-05T12:07:34+00:00

a word 'blah' found in the first iteration, so it is added. The second file doesn't have it, so it is removed. Then the third file has it, so it is added again.

No, that's not the algorithm I posted. You never add any words to the master except when reading the very first file. After that, once a word is removed it stays removed. Ie: As soon as you find one file without that word you know it can't meet the criteria, so you delete it and that's that.

The requirement needs to simutaneous look at all files

No, just pick and one as the "master" and run the rest against that one at a time. There's no need ever to build a map of all the words in all the files, nor to run every file against every other file.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 13 · 2014-08-05T17:10:26+00:00

JamesCherrill 4,733 Most Valuable Poster

10 Years Ago

Not mine, Anamika's. Credit where credit is due...
J

Find the most frequently used words across multiple files

Recommended Answers Collapse Answers

All 18 Replies

Recommended Answers