944,150 Members | Top Members by Rank

Ad:
  • Java Discussion Thread
  • Marked Solved
  • Views: 2327
  • Java RSS
You are currently viewing page 1 of this multi-page discussion thread
Nov 5th, 2009
0

Java html reading

Expand Post »
I haven't ever messed with any webpages through java and I'm kind of curious as to how it would work. I have a specific question that perhaps someone will be able to answer.

How would one go about pulling certain information from a forum? To better understand what I am asking, I'll give an example:

Lets say I am surfing this forum and am currently at: Then I click view source, this gives me the source of the current page opened.

Now lets say I want to pull out every title in there. Ex. title=""

After pulling out each title I would want to add it to a text file or something of the sort so I can refer to it later. For the mean time I would probably add it to an arraylist of some sort because the next thing I would want it to do is click next page and do it to the following page.

Can anyone shed some light on how to do such a thing? Or perhaps let me know what I need to read up on to get where I'm wanting to go.
Similar Threads
Reputation Points: 38
Solved Threads: 8
Junior Poster
KirkPatrick is offline Offline
162 posts
since Apr 2009
Nov 5th, 2009
1
Re: Java html reading
Moderator
Featured Poster
Reputation Points: 3239
Solved Threads: 839
Posting Genius
Ezzaral is offline Offline
6,761 posts
since May 2007
Nov 5th, 2009
0
Re: Java html reading
Much appreciated response ezzaral, just one quick question when reading the text from the url, does that automatically read the source of the page?
Last edited by KirkPatrick; Nov 5th, 2009 at 3:17 pm.
Reputation Points: 38
Solved Threads: 8
Junior Poster
KirkPatrick is offline Offline
162 posts
since Apr 2009
Nov 5th, 2009
1
Re: Java html reading
I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?

if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.

Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html

You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)

I did something like this but I have it at home. I'll rescue it at night to show you.
Reputation Points: 11
Solved Threads: 3
Newbie Poster
santiagozky is offline Offline
18 posts
since Oct 2009
Nov 5th, 2009
0
Re: Java html reading
I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?

if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.

Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html

You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)

I did something like this but I have it at home. I'll rescue it at night to show you.
I'll see if I can clear up my question a bit better for your understanding.

What I am wanting to do is create a program that will read a webpages source, grab a specific field, and write what the fields string down in a text file.

So another example would be, you're viewing this forum and lets say there is a field in the source named thread title.

Now in the source you see: thread_title="title1"

A few lines down in the same source you see thread_title again but this time it is: thread_title="title2"

Now lets say there are a hundred of these per page, and I am wanting to get all of them and write them to a text file, once page one is done I want to go to the next page (by clicking the next page button/link) and do the same thing.

So I would want me text file to look like this:
Quote ...
title1
title2
title3
title4
etc...
I hope that clears up what I am wanting to do for you, if you have an example that would be awesome. I tend to do better when reading how things work if I have an example in front of me as well.

So to answer your question, I am looking how to store the title strings

I'll be reading on patterns, regular expressions, and reading a url in the mean time. Thanks for the help so far guys
Last edited by KirkPatrick; Nov 5th, 2009 at 3:39 pm.
Reputation Points: 38
Solved Threads: 8
Junior Poster
KirkPatrick is offline Offline
162 posts
since Apr 2009
Nov 5th, 2009
0
Re: Java html reading
Much appreciated response ezzaral, just one quick question when reading the text from the url, does that automatically read the source of the page?
Give it a quick try and you'd have a definitive answer.
Moderator
Featured Poster
Reputation Points: 3239
Solved Threads: 839
Posting Genius
Ezzaral is offline Offline
6,761 posts
since May 2007
Nov 5th, 2009
0
Re: Java html reading
(Yes, the HTML page source is just marked up text)
Moderator
Featured Poster
Reputation Points: 3239
Solved Threads: 839
Posting Genius
Ezzaral is offline Offline
6,761 posts
since May 2007
Nov 5th, 2009
1
Re: Java html reading
this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.

Java Syntax (Toggle Plain Text)
  1. try {
  2. URL url = new URL(baseURL+"/index.php?id="+par);
  3. BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream()));
  4. String line;
  5. while((line=br.readLine())!=null){
  6. line=line.trim();
  7. String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";
  8. Pattern pattern = Pattern.compile(pat);
  9. Matcher matcher = pattern.matcher(line);
  10. boolean matchFound = matcher.find();
  11. if (matchFound) {
  12. String nombre=matcher.group(1);
  13. String urlI=baseURL+"/file/"+nombre+".jpg";
  14. Bajador baj= new Bajador(baseURL,nombre,par);
  15. baj.start();
  16. // baja(baseURL,nombre);
  17.  
  18. }
  19. }
  20. } catch (MalformedURLException ex) {
  21. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
  22. } catch (IOException e) {
  23. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
  24. }

Hope this help you.
Reputation Points: 11
Solved Threads: 3
Newbie Poster
santiagozky is offline Offline
18 posts
since Oct 2009
Nov 5th, 2009
2
Re: Java html reading
You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.
Moderator
Reputation Points: 2136
Solved Threads: 1228
Posting Genius
adatapost is offline Offline
6,527 posts
since Oct 2008
Nov 6th, 2009
0
Re: Java html reading
Wow you guys have been very helpful! I'm appreciative for each of your posts


Click to Expand / Collapse  Quote originally posted by Ezzaral ...
(Yes, the HTML page source is just marked up text)
Sorry I had left the other day before getting back to you about it, I had assumed it read straight from the source, but just wanted to make sure. Thanks for confirming that.
----

this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.

Java Syntax (Toggle Plain Text)
  1. try {
  2. URL url = new URL(baseURL+"/index.php?id="+par);
  3. BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream()));
  4. String line;
  5. while((line=br.readLine())!=null){
  6. line=line.trim();
  7. String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";
  8. Pattern pattern = Pattern.compile(pat);
  9. Matcher matcher = pattern.matcher(line);
  10. boolean matchFound = matcher.find();
  11. if (matchFound) {
  12. String nombre=matcher.group(1);
  13. String urlI=baseURL+"/file/"+nombre+".jpg";
  14. Bajador baj= new Bajador(baseURL,nombre,par);
  15. baj.start();
  16. // baja(baseURL,nombre);
  17.  
  18. }
  19. }
  20. } catch (MalformedURLException ex) {
  21. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
  22. } catch (IOException e) {
  23. Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
  24. }

Hope this help you.

That helped me quite a bit, I read into patterns and while it seems a bit confusing I understood enough to where I believe you are correct in saying I only needed to search from a-z and 0-9.

I am left a bit confused about some of your code, so if you don't mind I'll just ask the questions here for a better understanding.

java Syntax (Toggle Plain Text)
  1. if (matchFound) {
  2. String nombre=matcher.group(1);
  3. String urlI=baseURL+"/file/"+nombre+".jpg";
  4. Bajador baj= new Bajador(baseURL,nombre,par);
  5. baj.start();
  6. // baja(baseURL,nombre);
  7.  
  8. }

I understand that your code is for pulling images off a website (which I might be using later) and I also understand that if a match is found this is where I tell it what to do, but I'm a bit curious about your code here.

The java docs left me a bit confused about the matcher.group. Would you mind explaining the piece of code above?

Instead of bajador, I would put a class that will add the finding to an arraylist and write it to file, unless there is a better option
----

Click to Expand / Collapse  Quote originally posted by adatapost ...
You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.
Thanks for pointing this out, this looks like it cuts out a lot of the work :] I'll check into this as well. However, one thing is I'm also looking to interact with the web page too such as clicking buttons, I'm not sure if this would allow that.

----


Another question I have is that will I be able to manipulate menus and what not with similar code. Meaning, I am looking to also do such things through a search feature on forums, so would I be able to click buttons and select things from drop down menus?
Reputation Points: 38
Solved Threads: 8
Junior Poster
KirkPatrick is offline Offline
162 posts
since Apr 2009

This thread is solved

Either the thread starter or a moderator has marked this thread as solved. You can most likely trust the responses and answers given. There is most likely no reason for any further responses to be posted here. If you have a related question, please start a new thread in this forum instead.

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Java Forum Timeline: string wont send to JList
Next Thread in Java Forum Timeline: Java on debian help please





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC