6 Years
Discussion Span
Last Post by jon.kiparsky

What have you gotten so far for your code?
You have the definition fundamentals down for what you need to do, now you can search the java api for that. Searching the API for tokenizing will bring you upon a StringTokenizer. However, you could read in the file with a buffered reader that takes an argument of file reader, which take an argumenet of a file. Then you can read in each line of the file/text by calling readLine() method on the buffered reader, and then you can even split each word by invoking the split() method on a string.
The split method can take an argument of a delimiter and returns an array with each word split with the given delimiter.
Then you can iterate through each word and do what you please with it.

I hope that all made sense to you. If not, I can elaborate.

Edited by yasuodancez: n/a


i'm making use of the scanner to read the text from a file. but if my input file is an html page n i've to read the contents within the tag <text></text>, how do i do it.? i've done so much so far..

package test;

import java.io.*;
   import java.io.FileNotFoundException;
   import java.util.Scanner;

   public class Main {

     private static void readFile(String fileName) {
       try {
         File file = new File(fileName);
         Scanner scanner = new Scanner(file);
         while (scanner.hasNext()) {
       } catch (FileNotFoundException e) {

     public static void main(String[] args) {
       if (args.length != 1) {
         System.err.println("usage: java TextScanner1"
           + "file location");

Edited by mike_2000_17: Fixed formatting


Try to read line by line. Use the methods hasNextLine and readLine. Then save that line into a String.
Once you have the String, use the method indexOf(String) and find where the "<text>" and "</text>" are found. Then use the substring method.
All the above methods can be found at the java.lang.String API.


The indexOf method, will return 0 when you search for the "<text>", so in order to get the "aaaa", you will need to do subString(0+6, 10)
Where 0 and 10 would be the values that the indexOf method will return


Of course, you don't know where the tag starts or how long the text is, so

subString(s.indexOf("<text>")+"<text>".length(), s.indexOf("</text">));

would be closer to it. If that's difficult to read, take it apart piece by piece:

"<text>" is a string, so it has the String methods, so "<text>".length() returns 6 - why do we use the longer form? Because it makes it clear what it is we're measuring, and becaues you're likely to want to generalize in the future, so "<text>" might become a String variable called, say, tag, and you'd have tag.length() - but it would still work.

s.indexOf(arg ) returns the initial index of the String arg within the String s - in this case, it's our friend "<text>". Again, if you generalized it, you might find that you used tag in place of the explicit String.
The second srgument is just indexOf() again, which you know. So this works out to
"get me the String composed of the characters starting just after the first instance of "<text>" in my string, right up to the first instance of "</text>".

(by the way, this is not a great parsing method, though it works for simple input, and when nobody's trying to break it - see if you can come up with a few ways to break it)


Another good thing would be to check if the line you are trying to parse has that tag.
First call the indexOf method. If it returns -1 then the line doesn't have that tag <text> , so continue with the next line.

An in case that the line has more than one tag:
Line = <text>aa</text><text>bbb</text>
You can put that in a loop and take the next indexOf that tag. There is method that also takes as argument an int that indicates from where you want to start searching (indexOf("string", int))

But better leave that for last and handle simple cases first.

Edited by javaAddict: n/a


Checking the indexOf() value is probably a good idea. If you don't, you'll be getting the substring from (-1 +6 = 5) to (-1) on any line that doesn't have the </text> tag.

I wouldn't worry too much about the looping, though. If you're really trying for a robust parsing method, you want to get into stacks and regular expressions. If you're not, you have to decide just what cases you want to be able to handle.

My suggestion is, get the subString stuff running for the simplest case, something like:

blah blah <text> Here is the text to return </text> blah blah

and then determine what else you need to handle.


thank u one n all... :-) i completed it.. i'm overwhelmed by the response..

This article has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.