Hi,
I would like to help me with a problem I have.

I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).

Because in the input file there are numbers, html tags like <p id=#> and numbers like I. II. III. , I would like not to take place in output file.


In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(

Could anyone help me ?

public static void main(String arg[])  
    {
        new TestStreamTokenizer().testInOut(arg[0], arg[1]);
    }


private void createReadWriteStreams(String inFName, String outFName) 
{
            _fileReader = new FileReader(inFName);
            _fileWriter = new FileWriter(outFName);
            _printWriter = new PrintWriter(_fileWriter);
}

 public void testInOut(String inFName, String outFName) 
{
            createReadWriteStreams(inFName, outFName);
             StreamTokenizer tokenizer = new    StreamTokenizer(_fileReader);
            tokenizer.eolIsSignificant(true);

            int nextTok = tokenizer.nextToken();

            while (StreamTokenizer.TT_EOF != nextTok) 
            {
                // ........................
                //I don't know how can I do it???
                
            }
}

Thanks a lot


P.S. :
--------------------------
My input file is attached
----------------------------

Recommended Answers

All 5 Replies

Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutorial/essential/regex/

Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and <P ID=#> <P>.

Have I to call split a lot of times or can I do it differently?

Thanks a lot again!

Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and <P ID=#> <P>.

Well, you will have to work a little bit on the regular expressions to match on your content. The expression "<P ID=\d+>" would match your "<P ID=#>" tags, if they are always of that form. "<P>" by itself will match "<P>", so not much to that one. The roman numerals will be a little trickier, since they are merely a sequence of vertain capital letters followed by a period (in your example at least). You might get away with the pattern "[IVXLCDM]+\." for those, but there is a slight change you might accidently match some of your text by mistake (pretty unlikely I would say though.

Have I to call split a lot of times or can I do it differently?

You can first use the regular expressions to strip things you do not want to capture. If you are reading a line at a time in to a string variable, you can strip things out by calling replaceAll() with your regular expression and an empty string"" for the replacement string. After stripping out the unwanted content, call split(" ") to split on spaces to get your array of words to write out to file.

BufferedReader reader = new BufferedReader(new FileReader("foo.in"));
String inputString = reader.readLine();

// strip out the stuff you don't want
String cleanString = inputString.replaceAll("<P ID=\\d+>", "");
cleanString = cleanString.replaceAll("<P>", "");
cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", "");

// get the remaining words into an array
String[] words = cleanString.split(" ");

// loop words and write them, then read next line, etc.

This is just one way that might work for you. I would imagine someone who does a lot of file parsing with regular expressions could present a more efficient way, but this might give you a start.

Thanks a lot Ezzaral for your help.
The program works nearly perfect.

The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :

cleanString.replaceAll("\\.", " ");

But when I use it I have problem with the roman numerals (are printed in output file).

Any idea ?

And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")

Is there something like that ?

Thansk a lot!!!

I Promiss that I won't ask again!

You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.

On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.