| | |
tokenization of file input
Thread Solved |
•
•
Join Date: May 2007
Posts: 11
Reputation:
Solved Threads: 0
Hi,
I would like to help me with a problem I have.
I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).
Because in the input file there are numbers, html tags like <p id=#> and numbers like I. II. III. , I would like not to take place in output file.
In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(
Could anyone help me ?
Thanks a lot
P.S. :
--------------------------
My input file is attached
----------------------------
I would like to help me with a problem I have.
I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).
Because in the input file there are numbers, html tags like <p id=#> and numbers like I. II. III. , I would like not to take place in output file.
In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(
Could anyone help me ?
Java Syntax (Toggle Plain Text)
public static void main(String arg[]) { new TestStreamTokenizer().testInOut(arg[0], arg[1]); } private void createReadWriteStreams(String inFName, String outFName) { _fileReader = new FileReader(inFName); _fileWriter = new FileWriter(outFName); _printWriter = new PrintWriter(_fileWriter); } public void testInOut(String inFName, String outFName) { createReadWriteStreams(inFName, outFName); StreamTokenizer tokenizer = new StreamTokenizer(_fileReader); tokenizer.eolIsSignificant(true); int nextTok = tokenizer.nextToken(); while (StreamTokenizer.TT_EOF != nextTok) { // ........................ //I don't know how can I do it??? } }
P.S. :
--------------------------
My input file is attached
----------------------------
Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutor...sential/regex/
•
•
•
•
Thanks a lot for your answer.
I would like to ask you something more about split parameter.
How can I make a regular expression that delete the words that is like I. II. III. IV. .... and <P ID=#> <P>.
•
•
•
•
Have I to call split a lot of times or can I do it differently?
Java Syntax (Toggle Plain Text)
BufferedReader reader = new BufferedReader(new FileReader("foo.in")); String inputString = reader.readLine(); // strip out the stuff you don't want String cleanString = inputString.replaceAll("<P ID=\\d+>", ""); cleanString = cleanString.replaceAll("<P>", ""); cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", ""); // get the remaining words into an array String[] words = cleanString.split(" "); // loop words and write them, then read next line, etc.
•
•
Join Date: May 2007
Posts: 11
Reputation:
Solved Threads: 0
Thanks a lot Ezzaral for your help.
The program works nearly perfect.
The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :
cleanString.replaceAll("\\.", " ");
But when I use it I have problem with the roman numerals (are printed in output file).
Any idea ?
And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")
Is there something like that ?
Thansk a lot!!!
I Promiss that I won't ask again!
The program works nearly perfect.
The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :
cleanString.replaceAll("\\.", " ");
But when I use it I have problem with the roman numerals (are printed in output file).
Any idea ?
And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")
Is there something like that ?
Thansk a lot!!!
I Promiss that I won't ask again!
You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.
On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.
On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.
![]() |
Similar Threads
- Doesn't open for file input successfully.... why? (C++)
- file input problems (with windows?) (Java)
- Reading file input into an array (C++)
- Storing file input to an array? (C)
- file input don't know where to start (C++)
- File input. (C++)
- vc++ mfc-i can't make getline work with a string for file input (C++)
Other Threads in the Java Forum
- Previous Thread: Function Arguments
- Next Thread: Implementing a unix shell running commands
| Thread Tools | Search this Thread |
android api applet application apps array arrays automation awt bidirectional binary birt bluetooth businessintelligence busy_handler(null) card chat class classes client code collision columns component constructor database designadrawingapplicationusingjavajslider draw eclipse error errors eventlistener exception expand fractal game givemetehcodez graphics gui guidancer html ide image inetaddress input integer intellij j2me java javafx javamicroeditionuseofmotionsensor javaprojects jme jni jpanel jtree julia linux list loop machine map method methods mobile mobiledevelopmentcreatejar myaggfun netbeans newbie oracle parsing plazmic print problem program programming project recursion scanner server set sharepoint smart sms smsspam sort sortedmaps sql string subclass support swing textfield threads tree trolltech unlimited utility webservices windows






