Ok...my mind is exploding right now lol. I can't figure out a good way to parse Google results.

I am making a game tool, that will look up the 5 most recent news for a specified clan, within the past month, and for only 1 site.

So what I have is a code that first off creates a custom url, we shall use this example url:

http://www.google.com/search?q=allintitle%3A+Gladiatorz+site%3Aforums.zybez.net&hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&num=5&lr=&ft=i&cr=&safe=images&tbs=,qdr:m

If you pop that into the url bar, you will see that it does everything I asked it too.
it searches the domain of forums.zybez.net, for titles with Gladiatorz in it, and it posts 5 results that were made within the past month.

Now, my problem is that the HTML for this result is.....massive.
So I figured I would just cut off the first massive chunk of the HTML, which is just Google's stuff, but when I try indexOf, I need a unique item to search. The closest thing I can find is this (around character # 20348)

<h3 class="r">

However, there are quotes within it, and when I do the \" it adds the slash as part of the index search.

As you can see, this makes it very frustrating lol.

So any help in that area would be superb, OR if you have a better method of parsing this data, that would be cool too.

I will be working on this on and off ALL day, so I shall check this often.


Thanks,
-Austin

Recommended Answers

All 4 Replies

If you ask me "Search Results" would be a better String.

In any case you can always do this idiocy

char quote = '"';
String search = "<h3 class=" + quote + "r" + quote + ">"

or

char[] text = { '<', 'h', '3', ' ', 'c', 'l', 'a', 's', 's', '=', '"', 'r', '"', '>' }
String search = new String(text);

but, in any case the simple

String search = "<h3 class=\"r\">";

should work without any problem.

the String search = "<h3 class=\"r\">"; didn't work, because I printed out the result of the index location of it, and it was around the 35k character mark, not the 20k, and when I printed the substring(search,logpuller.length()) it had none of the results within it.

I shall try the other methods

Thanks,
0Austin

aanders,
a couple of weeks ago I was hired to do some data mining, and the mechanism at the core of the task I was assigned was this what you are trying to accomplish.

what i did is the following, and I believe this will help you, too.

I accessed remote files with Java's URL, BufferredReader, InputStream, and InputStreamReader classes. so, I read the html source code one line at a time, so every time I searched for a substring inside the line, the return was either -1 or somethinge way less than 35k, because a line of html code is never as long as that.

additionally, as I had to do the same for various sites, I would first analyze the source code of pages I was interested in and looked for ways to reduce the part of each page to be parsed, and that could help you too. the simplest example would be reading the source code between <body and </body instead of the entire page.

one more point to mention is that when I looked for the index of a string, say <hr class=, and needed to extract a string coming after the index returned, I would always add the length of this searched for string so that it I do not extract it.

I believe you know, but I want to remind you that string.indexOf("<hr class=") returns the position of < character... lest you forgot.

I hope I was of any help.

Hey guys, I got it to work, and I can pretty much parse everything PERFECTLY, I have a small glitch that I am working on...as I do not get why it is reacting the way it is, but so far the search is good.

Thanks!
-Austin

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.