943,691 Members | Top Members by Rank

Ad:
  • Java Discussion Thread
  • Marked Solved
  • Views: 8209
  • Java RSS
Jun 6th, 2007
0

tokenization of file input

Expand Post »
Hi,
I would like to help me with a problem I have.

I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).

Because in the input file there are numbers, html tags like <p id=#> and numbers like I. II. III. , I would like not to take place in output file.


In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(

Could anyone help me ?


Java Syntax (Toggle Plain Text)
  1.  
  2. public static void main(String arg[])
  3. {
  4. new TestStreamTokenizer().testInOut(arg[0], arg[1]);
  5. }
  6.  
  7.  
  8. private void createReadWriteStreams(String inFName, String outFName)
  9. {
  10. _fileReader = new FileReader(inFName);
  11. _fileWriter = new FileWriter(outFName);
  12. _printWriter = new PrintWriter(_fileWriter);
  13. }
  14.  
  15. public void testInOut(String inFName, String outFName)
  16. {
  17. createReadWriteStreams(inFName, outFName);
  18. StreamTokenizer tokenizer = new StreamTokenizer(_fileReader);
  19. tokenizer.eolIsSignificant(true);
  20.  
  21. int nextTok = tokenizer.nextToken();
  22.  
  23. while (StreamTokenizer.TT_EOF != nextTok)
  24. {
  25. // ........................
  26. //I don't know how can I do it???
  27.  
  28. }
  29. }
Thanks a lot


P.S. :
--------------------------
My input file is attached
----------------------------
Attached Files
File Type: txt input.txt (183.7 KB, 139 views)
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
katerinaaa is offline Offline
11 posts
since May 2007
Jun 6th, 2007
0

Re: tokenization of file input

Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutor...sential/regex/
Moderator
Featured Poster
Reputation Points: 3239
Solved Threads: 838
Posting Genius
Ezzaral is offline Offline
6,757 posts
since May 2007
Jun 6th, 2007
0

Re: tokenization of file input

Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and <P ID=#> <P>.

Have I to call split a lot of times or can I do it differently?

Thanks a lot again!
Reputation Points: 10
Solved Threads: 0
Newbie Poster
katerinaaa is offline Offline
11 posts
since May 2007
Jun 6th, 2007
0

Re: tokenization of file input

Click to Expand / Collapse  Quote originally posted by katerinaaa ...
Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and <P ID=#> <P>.
Well, you will have to work a little bit on the regular expressions to match on your content. The expression "<P ID=\d+>" would match your "<P ID=#>" tags, if they are always of that form. "<P>" by itself will match "<P>", so not much to that one. The roman numerals will be a little trickier, since they are merely a sequence of vertain capital letters followed by a period (in your example at least). You might get away with the pattern "[IVXLCDM]+\." for those, but there is a slight change you might accidently match some of your text by mistake (pretty unlikely I would say though.

Quote ...
Have I to call split a lot of times or can I do it differently?
You can first use the regular expressions to strip things you do not want to capture. If you are reading a line at a time in to a string variable, you can strip things out by calling replaceAll() with your regular expression and an empty string"" for the replacement string. After stripping out the unwanted content, call split(" ") to split on spaces to get your array of words to write out to file.

Java Syntax (Toggle Plain Text)
  1. BufferedReader reader = new BufferedReader(new FileReader("foo.in"));
  2. String inputString = reader.readLine();
  3.  
  4. // strip out the stuff you don't want
  5. String cleanString = inputString.replaceAll("<P ID=\\d+>", "");
  6. cleanString = cleanString.replaceAll("<P>", "");
  7. cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", "");
  8.  
  9. // get the remaining words into an array
  10. String[] words = cleanString.split(" ");
  11.  
  12. // loop words and write them, then read next line, etc.
This is just one way that might work for you. I would imagine someone who does a lot of file parsing with regular expressions could present a more efficient way, but this might give you a start.
Moderator
Featured Poster
Reputation Points: 3239
Solved Threads: 838
Posting Genius
Ezzaral is offline Offline
6,757 posts
since May 2007
Jun 7th, 2007
0

Re: tokenization of file input

Thanks a lot Ezzaral for your help.
The program works nearly perfect.

The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :

cleanString.replaceAll("\\.", " ");

But when I use it I have problem with the roman numerals (are printed in output file).

Any idea ?

And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")

Is there something like that ?

Thansk a lot!!!

I Promiss that I won't ask again!
Reputation Points: 10
Solved Threads: 0
Newbie Poster
katerinaaa is offline Offline
11 posts
since May 2007
Jun 7th, 2007
0

Re: tokenization of file input

You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.

On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.
Moderator
Featured Poster
Reputation Points: 3239
Solved Threads: 838
Posting Genius
Ezzaral is offline Offline
6,757 posts
since May 2007

This thread is solved

Either the thread starter or a moderator has marked this thread as solved. You can most likely trust the responses and answers given. There is most likely no reason for any further responses to be posted here. If you have a related question, please start a new thread in this forum instead.

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Java Forum Timeline: Function Arguments
Next Thread in Java Forum Timeline: Implementing a unix shell running commands





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC