Hi,
I am developing a web crawler using java. I have implemented it to some extent, like I have developed program which parses all the hyperlinks from the entered URL and and visits each link one by one and iterates this process. Now I want to parse all the visible text from a particular web page. I am facing problem in this. Can anyone suggest how to accomplish this. Any help wil be greatly appreciated.
Thanks in advance
Rishabh jha

Recommended Answers

All 4 Replies

Since you are retrieving the HTML code, you can parse the <a href = "HTML SITE> </a> tags in order to get the hyperlinks. What exactly do you mean by "parsing all the visible text from the particular web page"?

Hi apines,
I have done exactly the same for for parsing the hyperlinks, and with visible text I mean all the text material that is visible on the web page. I will be obliged if you could come up with a solution as visible text are present in various form in a web page(title, body, heading etc etc).

I am not sure that I fully understood - can you please give me an example for visible text that you cannot parse?

Also, a search in DaniWeb revealed this post, which contains more information and other links to tutorials regarding web crawlers and Java. Might help you as well.

There are many instances. For ex- text displayed on button, on links etc etc. the problem is that every webpage has a different layout, actually i am not able to understand that the text which we are viewing is present on which portion of webpage, if you analyse source code of any web page you will find that there is no specific region of the visible text.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.