| | |
tokenization of file input
Thread Solved |
•
•
Join Date: May 2007
Posts: 11
Reputation:
Solved Threads: 0
Hi,
I would like to help me with a problem I have.
I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).
Because in the input file there are numbers, html tags like <p id=#> and numbers like I. II. III. , I would like not to take place in output file.
In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(
Could anyone help me ?
Thanks a lot
P.S. :
--------------------------
My input file is attached
----------------------------
I would like to help me with a problem I have.
I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).
Because in the input file there are numbers, html tags like <p id=#> and numbers like I. II. III. , I would like not to take place in output file.
In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(
Could anyone help me ?
Java Syntax (Toggle Plain Text)
public static void main(String arg[]) { new TestStreamTokenizer().testInOut(arg[0], arg[1]); } private void createReadWriteStreams(String inFName, String outFName) { _fileReader = new FileReader(inFName); _fileWriter = new FileWriter(outFName); _printWriter = new PrintWriter(_fileWriter); } public void testInOut(String inFName, String outFName) { createReadWriteStreams(inFName, outFName); StreamTokenizer tokenizer = new StreamTokenizer(_fileReader); tokenizer.eolIsSignificant(true); int nextTok = tokenizer.nextToken(); while (StreamTokenizer.TT_EOF != nextTok) { // ........................ //I don't know how can I do it??? } }
P.S. :
--------------------------
My input file is attached
----------------------------
Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutor...sential/regex/
•
•
•
•
Thanks a lot for your answer.
I would like to ask you something more about split parameter.
How can I make a regular expression that delete the words that is like I. II. III. IV. .... and <P ID=#> <P>.
•
•
•
•
Have I to call split a lot of times or can I do it differently?
Java Syntax (Toggle Plain Text)
BufferedReader reader = new BufferedReader(new FileReader("foo.in")); String inputString = reader.readLine(); // strip out the stuff you don't want String cleanString = inputString.replaceAll("<P ID=\\d+>", ""); cleanString = cleanString.replaceAll("<P>", ""); cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", ""); // get the remaining words into an array String[] words = cleanString.split(" "); // loop words and write them, then read next line, etc.
•
•
Join Date: May 2007
Posts: 11
Reputation:
Solved Threads: 0
Thanks a lot Ezzaral for your help.
The program works nearly perfect.
The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :
cleanString.replaceAll("\\.", " ");
But when I use it I have problem with the roman numerals (are printed in output file).
Any idea ?
And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")
Is there something like that ?
Thansk a lot!!!
I Promiss that I won't ask again!
The program works nearly perfect.
The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :
cleanString.replaceAll("\\.", " ");
But when I use it I have problem with the roman numerals (are printed in output file).
Any idea ?
And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")
Is there something like that ?
Thansk a lot!!!
I Promiss that I won't ask again!
You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.
On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.
On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.
![]() |
Similar Threads
- Doesn't open for file input successfully.... why? (C++)
- file input problems (with windows?) (Java)
- Reading file input into an array (C++)
- Storing file input to an array? (C)
- file input don't know where to start (C++)
- File input. (C++)
- vc++ mfc-i can't make getline work with a string for file input (C++)
Other Threads in the Java Forum
- Previous Thread: Function Arguments
- Next Thread: Implementing a unix shell running commands
| Thread Tools | Search this Thread |
add android api applet application applications array arrays automation bank binary bluetooth chat class clear client code codesnippet collections component converter database development dice digit ebook eclipse equation error event formatingtextintooltipjava fractal functiontesting game givemetehcodez graphics gui health html hyper ide idea image infinite input int integer invokingapacheantprogrammatically j2me java javame javaprojects jni jpanel julia linux list loop looping main map method methods mobile myregfun mysql netbeans newbie nonstatic openjavafx parameter pearl php problem program programming project recursion repositories scanner scrollbar server set sms sort sorting spamblocker sql sqlserver state storm string superclass swing swt text-file thread threads tree windows






