Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutorial/essential/regex/
Ezzaral
Posting Genius
15,985 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
Thanks a lot for your answer.
I would like to ask you something more about split parameter.
How can I make a regular expression that delete the words that is like I. II. III. IV. .... and
.
Well, you will have to work a little bit on the regular expressions to match on your content. The expression "" would match your "
" tags, if they are always of that form. "
" by itself will match "
", so not much to that one. The roman numerals will be a little trickier, since they are merely a sequence of vertain capital letters followed by a period (in your example at least). You might get away with the pattern "[IVXLCDM]+\." for those, but there is a slight change you might accidently match some of your text by mistake (pretty unlikely I would say though.
Have I to call split a lot of times or can I do it differently?
You can first use the regular expressions to strip things you do not want to capture. If you are reading a line at a time in to a string variable, you can strip things out by calling replaceAll() with your regular expression and an empty string"" for the replacement string. After stripping out the unwanted content, call split(" ") to split on spaces to get your array of words to write out to file.
BufferedReader reader = new BufferedReader(new FileReader("foo.in"));
String inputString = reader.readLine();
// strip out the stuff you don't want
String cleanString = inputString.replaceAll("<P ID=\\d+>", "");
cleanString = cleanString.replaceAll("<P>", "");
cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", "");
// get the remaining words into an array
String[] words = cleanString.split(" ");
// loop words and write them, then read next line, etc.
This is just one way that might work for you. I would imagine someone who does a lot of file parsing with regular expressions could present a more efficient way, but this might give you a start.
Ezzaral
Posting Genius
15,985 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.
On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.
Ezzaral
Posting Genius
15,985 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847