Fast searching through a html page using InputStreamReader

Question

Kriogenic 0 Newbie Poster

15 Years Ago

Hey Everyone,
I am seeming to be in a spot of trouble.
I am using the following code to read a HTML page, loop through it line by line. problem is it is around 7300 lines and takes about 20 seconds to finish the loop.

I was wondering if anyone could help me with a way of speeding it up a little bit?

String urltext = "http://www.kriogenic.com/javadoc/org/rsbot/script/Methods.html";

       URL url = new URL(urltext);

		
		BufferedReader in = new BufferedReader(new InputStreamReader(url
				.openStream()));
		String inputLine = "";
		String temp;
		int i = 10;
		while ((temp = in.readLine()) != null) {
		inputLine += temp;
		if(temp.contains("<!-- ============ METHOD DETAIL ========== -->")){
		i = 0;
		}
		if(i == 0 && temp.contains(messages[2])){
		int io = temp.indexOf(messages[2]);
String il = temp.substring(io, (temp.indexOf(")",io) + 1));
String link = urltext + "#" + il;
sendMessage(sender,link);
sendMessage(channel,"Took " + (System.currentTimeMillis() - mt) + " Milliseconds");
i = 10;
		}


		}
		in.close();

html-css java

Edited 15 Years Ago by Kriogenic because: n/a

2 Contributors
9 Replies
87 Views
21 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by Kriogenic

All 9 Replies

adams161 21 Posting Whiz in Training

15 Years Ago

you are checking if it contains that string on every line you read. if the page has 1000 lines and you find it on line 963, you had to do your contains method, 962 times when it came up negative.

If the page is huge it can make sense to check for the string before reaching the end. But if the page has 1000 lines and you find it on your 355 th attemp on line 355, would it have hurt to read a few more lines and do your contains only at intervals? For example you could do your contains check every 10 lines or every 100 lines, and to be sure you get your last contain in, do it at the end.

If your goal is to stop searching before the end you want to find a balance between doing contains only at intervals, and allowing extra lines to be searched, and the other extream of doing it every line.

I'm not sure what the cost of reading a line is, but doing the contains operation hundreds of extra times may not be worth saving reading some extra lines.

Mike

adams161 21 Posting Whiz in Training

15 Years Ago

also if you find what you are looking for, maybe break from the while loop, as is it looks like you search to the end of the page every time. If you are going to do that maybe just check once at the end.

adams161 21 Posting Whiz in Training

15 Years Ago

edited: Looks like you are not using inputline, just one more thing you may not need to do. also since it contains the whole page, calling it inputline may be a misnaming.

Edited 15 Years Ago by adams161 because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

adams161 21 Posting Whiz in Training · Answer 1 · 2009-12-11T12:05:36+00:00

I see you are only searching temp, the line, and never use input line. The interval thing i started with might work fine. Sorry for the multiple responses. you might create a variable lineblock, it contains 100 lines at a time. then on 9700 lines you search 97 block of 100. This might be faster. If you need to know line by line once you find it, like its on lines 45 , 54 , and 75, dump lineblock into another method you call to search the more interesting block of 100 lines in more detail.

Also i don't know how the contains method works in real terms beyind the scene, but it may take longer to find a longer string. i'm thinking if i wrote a contains method. i'd do, if i was looking for 'cat' loop tell i find c, then if i do look for a then if i do look for t. depending on the layout of your page, you might get some speed if you harnes this idea to first look for a smaller string then look for the full string in a second try. if comments are few then just <!-- may be enough to look for and cheap in terms of cpu time. if comments are everywhere then maybe look for METHOD DETAIL if its there do the full try.

Kriogenic 0 Newbie Poster · Answer 2 · 2009-12-11T14:21:11+00:00

Basically its because there is a variable passed and to find the variable in the html so basically that first string I am looking for is always in a different spot. so I look for that first and once thats found I know the first instance of the string I am looking for AFTER that first string is the right one.

I have added a break so it breaks from the while loop after finding the correct one which speeds it up a tiny bit unless the string searched for is near the end and input line is never used it was something I forgot to get rid of.

Basically it takes roughly 20 - 30 seconds to complete that while loop and i need to get it under 10 with the same result which is finding the first occurance of a string in variable messages[2] after the first and only occurance of  in a html file that has roughly 7300 lines.

Any suggestions would be appreciated. Thanks adams161, I don't think I need to search line by line I could do the search after every 100 lines, but I think line by line is faster as I would still have to concatenate each line to the last to make a string of 100 lines to search.

Thanks,
Kriogenic.

adams161 21 Posting Whiz in Training · Answer 3 · 2009-12-11T15:23:17+00:00

I'd still be thinking about how the java programmers wrote the contains method. if they reduce the string to a big array of characters in memory, then "dadlaow" is 7 characters. if i want to find 'ao' i have to do

for(int a=0; a<string.length a++)
{
if(string[a]='a')
{
if(string[a+1]=='o'
{
// more checks if string is longer and you keep finding hits i.e. your finding the pattern
}
}
}

Now i chose 'ao' for a reason. notice 'ad' occured first. that meant on the first instance of 'a' it had to enter the check for 'o'. So more lines of code had to run here than if i searched 'lo'.

Your character you first search for '[' is very common in html documents. Every time it encounters '[' it has to check for more characters. futher more every comment if there are lots of comments will flag for more additional checks. The more false hits it gets on teh first character and the longer it has to iterate to keep checking additional characters the more load.

METHOD DETAIL contains capitalized characters. they may be more uncommon on the page. maybe if you did

if(temp.contains("HO")
{
// now do your second check for what you are really looking for

}

If captital "H" occurs much less frequencly on the page than "[" you should get speed if contains works this way.

Further more from my example you can see that every time contains() is called there is some setup programmatically that the java method has to do. you have the cost of concating, but you were allready doing that with lineinput, so if removing lineinput didnt change speed much then concatanating might be inexpensive. setting up the contains function 9700 times might be more expensive than concatanating. You still run contains on the same number of characters or lines, but you don't have to load a method into memory, set up the for loop its probably using and the if statments to compare.

I'd say try it and see if it helps. I've had to write algorithms for speed at times and you never know exactly what is going to help tell you try things. Though offhand i'd say running contains to first look for "[" then always do additional work, is expensive since "[" is common on html pages. Ultimately though you just have to try things.

Mike

adams161 21 Posting Whiz in Training · Answer 4 · 2009-12-11T15:31:53+00:00

I got to admit though i get nervous when you say breaking from the while loop only speeds it up a tiny bit unless the string is near the end. If the string is in the middle and it doesnt speed it up like by several seconds then your while loop may not be as expensive after all. You have the factor of download speed happening as well. Maybe of your method where it takes 19 seconds only 2 or 3 seconds are the while loop and 16 or 17 seconds are download speed to load the page. You probably need to know this.

adams161 21 Posting Whiz in Training · Answer 5 · 2009-12-11T16:27:13+00:00

i wrote the following program

import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.net.*;
import java.lang.Thread.*;
import java.applet.*;

public class container
{

public static void main(String[] args) 
{
String temp;
String temp1="a";
System.out.println("start");
for(int a=0; a< 9700; a++)
{
temp="P.S. Do you find DaniWeb helpful? If so, please consider a 

Member Donation. Supporting our";
temp1+=temp;  // added on second try
if(a%100==0)// added on third try before contains was done each line and second try temp1 variable continually grew
{
boolean b=temp1.contains("its");
temp1="";

}
}
System.out.println("done");

}
}

from what i gather concatanating the whole web page to inputline is very expensive. what you see above is the third pass. it runs fast.

first i just wrote the string to temp and did a contains.

performance was fast.

then i made temp1 and added temp to it.

temp1+=temp;

well my string was longer and just by concatanating the whole thing it took a minute to run. and before it was a second!

then i did my third try, concatanating 100 lines and then deleteing. this ran in about a second i'd say. i think no concatainating is fastest.

so i dont know your specs excactly and what is going to simulate what is going on with you. but that line were you added temp to the line over and over, is incredibly slow on my machine. just adding temp to temp1 100 times then deleting, not a big deal.

Mike

Kriogenic 0 Newbie Poster · Answer 6 · 2009-12-11T23:13:18+00:00

adams161, thanks a lot. I took your advice and it now takes 3 seconds opposed to 20 seconds.

Thanks,
Kriogenic.

Fast searching through a html page using InputStreamReader

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers