function that tokenizes the text from a file

Question

sushant5252 0 Newbie Poster

13 Years Ago

hi i am a begginer in java and i neeed help in completing this program...

java

5 Contributors
9 Replies
98 Views
1 Week Discussion Span
Latest Post 13 Years Ago Latest Post by jon.kiparsky

All 9 Replies

javaAddict 900 Nearly a Senior Poster

13 Years Ago

Try to read line by line. Use the methods hasNextLine and readLine. Then save that line into a String.
Once you have the String, use the method indexOf(String) and find where the "<text>" and "</text>" are found. Then use the substring method.
All the above methods can be found at the java.lang.String API.

Remember:
<text>aaaa</text>
012345678910

The indexOf method, will return 0 when you search for the "<text>", so in order to get the "aaaa", you will need to do subString(0+6, 10)
Where 0 and 10 would be the values that the indexOf method will return

NormR1 563 Posting Sage

13 Years Ago

subString(0+6, 10)
or
subString(0+"<text>".length(), 10)

jon.kiparsky 326 Posting Virtuoso

13 Years Ago

Of course, you don't know where the tag starts or how long the text is, so

subString(s.indexOf("<text>")+"<text>".length(), s.indexOf("</text">));

would be closer to it. If that's difficult to read, take it apart piece by piece:

"<text>" is a string, so it has the String methods, so "<text>".length() returns 6 - why do we use the longer form? Because it makes it clear what it is we're measuring, and becaues you're likely to want to generalize in the future, so "<text>" might become a String variable called, say, tag, and you'd have tag.length() - but it would still work.

s.indexOf(arg ) returns the initial index of the String arg within the String s - in this case, it's our friend "<text>". Again, if you generalized it, you might find that you used tag in place of the explicit String.
The second srgument is just indexOf() again, which you know. So this works out to
"get me the String composed of the characters starting just after the first instance of "<text>" in my string, right up to the first instance of "</text>".

(by the way, this is not a great parsing method, though it works for simple input, and when nobody's trying to break it - see if you can come up with a few ways to break it)

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

yasuodancez -3 Newbie Poster · Answer 1 · 2010-09-02T09:34:03+00:00

What have you gotten so far for your code?
You have the definition fundamentals down for what you need to do, now you can search the java api for that. Searching the API for tokenizing will bring you upon a StringTokenizer. However, you could read in the file with a buffered reader that takes an argument of file reader, which take an argumenet of a file. Then you can read in each line of the file/text by calling readLine() method on the buffered reader, and then you can even split each word by invoking the split() method on a string.
The split method can take an argument of a delimiter and returns an array with each word split with the given delimiter.
Then you can iterate through each word and do what you please with it.

I hope that all made sense to you. If not, I can elaborate.

sushant5252 0 Newbie Poster · Answer 2 · 2010-09-02T10:11:37+00:00

i'm making use of the scanner to read the text from a file. but if my input file is an html page n i've to read the contents within the tag <text></text>, how do i do it.? i've done so much so far..

package test;

import java.io.*;
   import java.io.FileNotFoundException;
   import java.util.Scanner;

   public class Main {

     private static void readFile(String fileName) {
       try {
         File file = new File(fileName);
         Scanner scanner = new Scanner(file);
         while (scanner.hasNext()) {
           System.out.println(scanner.next());
         }
         scanner.close();
       } catch (FileNotFoundException e) {
         e.printStackTrace();
       }
     }

     public static void main(String[] args) {
       if (args.length != 1) {
         System.err.println("usage: java TextScanner1"
           + "file location");
         System.exit(0);
       }
       readFile(args[0]);
     }
   }

javaAddict 900 Nearly a Senior Poster Team Colleague Featured Poster · Answer 3 · 2010-09-02T20:02:12+00:00

Another good thing would be to check if the line you are trying to parse has that tag.
First call the indexOf method. If it returns -1 then the line doesn't have that tag <text> , so continue with the next line.

An in case that the line has more than one tag:
Line = <text>aa</text><text>bbb</text>
You can put that in a loop and take the next indexOf that tag. There is method that also takes as argument an int that indicates from where you want to start searching (indexOf("string", int))

But better leave that for last and handle simple cases first.

jon.kiparsky 326 Posting Virtuoso · Answer 4 · 2010-09-02T20:20:18+00:00

Checking the indexOf() value is probably a good idea. If you don't, you'll be getting the substring from (-1 +6 = 5) to (-1) on any line that doesn't have the </text> tag.

I wouldn't worry too much about the looping, though. If you're really trying for a robust parsing method, you want to get into stacks and regular expressions. If you're not, you have to decide just what cases you want to be able to handle.

My suggestion is, get the subString stuff running for the simplest case, something like:

blah blah <text> Here is the text to return </text> blah blah

and then determine what else you need to handle.

sushant5252 0 Newbie Poster · Answer 5 · 2010-09-09T09:41:15+00:00

thank u one n all... :-) i completed it.. i'm overwhelmed by the response..

jon.kiparsky 326 Posting Virtuoso · Answer 6 · 2010-09-09T10:17:46+00:00

jon.kiparsky 326 Posting Virtuoso

13 Years Ago

Glad to help. Hope it was fun to write.

function that tokenizes the text from a file

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers