Java html reading

Please support our Java advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved

Join Date: Apr 2009
Posts: 114
Reputation: KirkPatrick is an unknown quantity at this point 
Solved Threads: 3
KirkPatrick KirkPatrick is offline Offline
Junior Poster

Java html reading

 
0
  #1
29 Days Ago
I haven't ever messed with any webpages through java and I'm kind of curious as to how it would work. I have a specific question that perhaps someone will be able to answer.

How would one go about pulling certain information from a forum? To better understand what I am asking, I'll give an example:

Lets say I am surfing this forum and am currently at: Then I click view source, this gives me the source of the current page opened.

Now lets say I want to pull out every title in there. Ex. title=""

After pulling out each title I would want to add it to a text file or something of the sort so I can refer to it later. For the mean time I would probably add it to an arraylist of some sort because the next thing I would want it to do is click next page and do it to the following page.

Can anyone shed some light on how to do such a thing? Or perhaps let me know what I need to read up on to get where I'm wanting to go.
Reply With Quote Quick reply to this message  
Join Date: May 2007
Posts: 4,483
Reputation: Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of 
Solved Threads: 515
Moderator
Featured Poster
Ezzaral's Avatar
Ezzaral Ezzaral is offline Offline
Industrious Poster
 
1
  #2
29 Days Ago
Reply With Quote Quick reply to this message  
Join Date: Apr 2009
Posts: 114
Reputation: KirkPatrick is an unknown quantity at this point 
Solved Threads: 3
KirkPatrick KirkPatrick is offline Offline
Junior Poster
 
0
  #3
29 Days Ago
Much appreciated response ezzaral, just one quick question when reading the text from the url, does that automatically read the source of the page?
Last edited by KirkPatrick; 29 Days Ago at 3:17 pm.
Reply With Quote Quick reply to this message  
Join Date: Oct 2009
Posts: 18
Reputation: santiagozky is an unknown quantity at this point 
Solved Threads: 3
santiagozky santiagozky is offline Offline
Newbie Poster
 
1
  #4
29 Days Ago
I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?

if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.

Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html

You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)

I did something like this but I have it at home. I'll rescue it at night to show you.
Reply With Quote Quick reply to this message  
Join Date: Apr 2009
Posts: 114
Reputation: KirkPatrick is an unknown quantity at this point 
Solved Threads: 3
KirkPatrick KirkPatrick is offline Offline
Junior Poster
 
0
  #5
29 Days Ago
Originally Posted by santiagozky View Post
I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?

if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.

Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html

You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)

I did something like this but I have it at home. I'll rescue it at night to show you.
I'll see if I can clear up my question a bit better for your understanding.

What I am wanting to do is create a program that will read a webpages source, grab a specific field, and write what the fields string down in a text file.

So another example would be, you're viewing this forum and lets say there is a field in the source named thread title.

Now in the source you see: thread_title="title1"

A few lines down in the same source you see thread_title again but this time it is: thread_title="title2"

Now lets say there are a hundred of these per page, and I am wanting to get all of them and write them to a text file, once page one is done I want to go to the next page (by clicking the next page button/link) and do the same thing.

So I would want me text file to look like this:
title1
title2
title3
title4
etc...
I hope that clears up what I am wanting to do for you, if you have an example that would be awesome. I tend to do better when reading how things work if I have an example in front of me as well.

So to answer your question, I am looking how to store the title strings

I'll be reading on patterns, regular expressions, and reading a url in the mean time. Thanks for the help so far guys
Last edited by KirkPatrick; 29 Days Ago at 3:39 pm.
Reply With Quote Quick reply to this message  
Join Date: May 2007
Posts: 4,483
Reputation: Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of 
Solved Threads: 515
Moderator
Featured Poster
Ezzaral's Avatar
Ezzaral Ezzaral is offline Offline
Industrious Poster
 
0
  #6
29 Days Ago
Originally Posted by KirkPatrick View Post
Much appreciated response ezzaral, just one quick question when reading the text from the url, does that automatically read the source of the page?
Give it a quick try and you'd have a definitive answer.
Reply With Quote Quick reply to this message  
Join Date: May 2007
Posts: 4,483
Reputation: Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of Ezzaral has much to be proud of 
Solved Threads: 515
Moderator
Featured Poster
Ezzaral's Avatar
Ezzaral Ezzaral is offline Offline
Industrious Poster
 
0
  #7
29 Days Ago
(Yes, the HTML page source is just marked up text)
Reply With Quote Quick reply to this message  
Join Date: Oct 2009
Posts: 18
Reputation: santiagozky is an unknown quantity at this point 
Solved Threads: 3
santiagozky santiagozky is offline Offline
Newbie Poster
 
1
  #8
29 Days Ago
this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.

  1. try {
  2. URL url = new URL(baseURL+"/index.php?id="+par);
  3. BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream()));
  4. String line;
  5. while((line=br.readLine())!=null){
  6. line=line.trim();
  7. String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";
  8. Pattern pattern = Pattern.compile(pat);
  9. Matcher matcher = pattern.matcher(line);
  10. boolean matchFound = matcher.find();
  11. if (matchFound) {
  12. String nombre=matcher.group(1);
  13. String urlI=baseURL+"/file/"+nombre+".jpg";
  14. Bajador baj= new Bajador(baseURL,nombre,par);
  15. baj.start();
  16. // baja(baseURL,nombre);
  17.  
  18. }
  19. }
  20. } catch (MalformedURLException ex) {
  21. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
  22. } catch (IOException e) {
  23. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
  24. }

Hope this help you.
Reply With Quote Quick reply to this message  
Join Date: Oct 2008
Posts: 2,636
Reputation: adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of adatapost has much to be proud of 
Solved Threads: 472
Moderator
adatapost's Avatar
adatapost adatapost is offline Offline
Posting Maven
 
2
  #9
29 Days Ago
You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.
Failure is not fatal, but failure to change might be. - John Wooden
Reply With Quote Quick reply to this message  
Join Date: Apr 2009
Posts: 114
Reputation: KirkPatrick is an unknown quantity at this point 
Solved Threads: 3
KirkPatrick KirkPatrick is offline Offline
Junior Poster
 
0
  #10
28 Days Ago
Wow you guys have been very helpful! I'm appreciative for each of your posts


Originally Posted by Ezzaral View Post
(Yes, the HTML page source is just marked up text)
Sorry I had left the other day before getting back to you about it, I had assumed it read straight from the source, but just wanted to make sure. Thanks for confirming that.
----

Originally Posted by santiagozky View Post
this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.

  1. try {
  2. URL url = new URL(baseURL+"/index.php?id="+par);
  3. BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream()));
  4. String line;
  5. while((line=br.readLine())!=null){
  6. line=line.trim();
  7. String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";
  8. Pattern pattern = Pattern.compile(pat);
  9. Matcher matcher = pattern.matcher(line);
  10. boolean matchFound = matcher.find();
  11. if (matchFound) {
  12. String nombre=matcher.group(1);
  13. String urlI=baseURL+"/file/"+nombre+".jpg";
  14. Bajador baj= new Bajador(baseURL,nombre,par);
  15. baj.start();
  16. // baja(baseURL,nombre);
  17.  
  18. }
  19. }
  20. } catch (MalformedURLException ex) {
  21. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
  22. } catch (IOException e) {
  23. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
  24. }

Hope this help you.

That helped me quite a bit, I read into patterns and while it seems a bit confusing I understood enough to where I believe you are correct in saying I only needed to search from a-z and 0-9.

I am left a bit confused about some of your code, so if you don't mind I'll just ask the questions here for a better understanding.

  1. if (matchFound) {
  2. String nombre=matcher.group(1);
  3. String urlI=baseURL+"/file/"+nombre+".jpg";
  4. Bajador baj= new Bajador(baseURL,nombre,par);
  5. baj.start();
  6. // baja(baseURL,nombre);
  7.  
  8. }

I understand that your code is for pulling images off a website (which I might be using later) and I also understand that if a match is found this is where I tell it what to do, but I'm a bit curious about your code here.

The java docs left me a bit confused about the matcher.group. Would you mind explaining the piece of code above?

Instead of bajador, I would put a class that will add the finding to an arraylist and write it to file, unless there is a better option
----

Originally Posted by adatapost View Post
You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.
Thanks for pointing this out, this looks like it cuts out a lot of the work :] I'll check into this as well. However, one thing is I'm also looking to interact with the web page too such as clicking buttons, I'm not sure if this would allow that.

----


Another question I have is that will I be able to manipulate menus and what not with similar code. Meaning, I am looking to also do such things through a search feature on forums, so would I be able to click buttons and select things from drop down menus?
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC