954,135 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

tokenization of file input

Hi,
I would like to help me with a problem I have.

I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).

Because in the input file there are numbers, html tags like and numbers like I. II. III. , I would like not to take place in output file.


In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(

Could anyone help me ?

public static void main(String arg[])  
    {
        new TestStreamTokenizer().testInOut(arg[0], arg[1]);
    }


private void createReadWriteStreams(String inFName, String outFName) 
{
            _fileReader = new FileReader(inFName);
            _fileWriter = new FileWriter(outFName);
            _printWriter = new PrintWriter(_fileWriter);
}

 public void testInOut(String inFName, String outFName) 
{
            createReadWriteStreams(inFName, outFName);
             StreamTokenizer tokenizer = new    StreamTokenizer(_fileReader);
            tokenizer.eolIsSignificant(true);

            int nextTok = tokenizer.nextToken();

            while (StreamTokenizer.TT_EOF != nextTok) 
            {
                // ........................
                //I don't know how can I do it???
                
            }
}


Thanks a lot


P.S. :
--------------------------
My input file is attached
----------------------------

Attachments input.txt (183.67KB)
katerinaaa
Newbie Poster
11 posts since May 2007
Reputation Points: 10
Solved Threads: 0
 

Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutorial/essential/regex/

Ezzaral
Posting Genius
Moderator
15,985 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
 

Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and

.

Have I to call split a lot of times or can I do it differently?

Thanks a lot again!

katerinaaa
Newbie Poster
11 posts since May 2007
Reputation Points: 10
Solved Threads: 0
 

Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and

.


Well, you will have to work a little bit on the regular expressions to match on your content. The expression "" would match your "

" tags, if they are always of that form. "

" by itself will match "

", so not much to that one. The roman numerals will be a little trickier, since they are merely a sequence of vertain capital letters followed by a period (in your example at least). You might get away with the pattern "[IVXLCDM]+\." for those, but there is a slight change you might accidently match some of your text by mistake (pretty unlikely I would say though.

Have I to call split a lot of times or can I do it differently?


You can first use the regular expressions to strip things you do not want to capture. If you are reading a line at a time in to a string variable, you can strip things out by calling replaceAll() with your regular expression and an empty string"" for the replacement string. After stripping out the unwanted content, call split(" ") to split on spaces to get your array of words to write out to file.

BufferedReader reader = new BufferedReader(new FileReader("foo.in"));
String inputString = reader.readLine();

// strip out the stuff you don't want
String cleanString = inputString.replaceAll("<P ID=\\d+>", "");
cleanString = cleanString.replaceAll("<P>", "");
cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", "");

// get the remaining words into an array
String[] words = cleanString.split(" ");

// loop words and write them, then read next line, etc.



This is just one way that might work for you. I would imagine someone who does a lot of file parsing with regular expressions could present a more efficient way, but this might give you a start.

Ezzaral
Posting Genius
Moderator
15,985 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
 

Thanks a lot Ezzaral for your help.
The program works nearly perfect.

The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :

cleanString.replaceAll("\\.", " ");

But when I use it I have problem with the roman numerals (are printed in output file).

Any idea ?

And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")

Is there something like that ?

Thansk a lot!!!

I Promiss that I won't ask again!

katerinaaa
Newbie Poster
11 posts since May 2007
Reputation Points: 10
Solved Threads: 0
 

You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.

On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.

Ezzaral
Posting Genius
Moderator
15,985 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You