| | |
Java html reading
Please support our Java advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved |
•
•
Join Date: Apr 2009
Posts: 114
Reputation:
Solved Threads: 3
I haven't ever messed with any webpages through java and I'm kind of curious as to how it would work. I have a specific question that perhaps someone will be able to answer.
How would one go about pulling certain information from a forum? To better understand what I am asking, I'll give an example:
Lets say I am surfing this forum and am currently at: Then I click view source, this gives me the source of the current page opened.
Now lets say I want to pull out every title in there. Ex. title=""
After pulling out each title I would want to add it to a text file or something of the sort so I can refer to it later. For the mean time I would probably add it to an arraylist of some sort because the next thing I would want it to do is click next page and do it to the following page.
Can anyone shed some light on how to do such a thing? Or perhaps let me know what I need to read up on to get where I'm wanting to go.
How would one go about pulling certain information from a forum? To better understand what I am asking, I'll give an example:
Lets say I am surfing this forum and am currently at: Then I click view source, this gives me the source of the current page opened.
Now lets say I want to pull out every title in there. Ex. title=""
After pulling out each title I would want to add it to a text file or something of the sort so I can refer to it later. For the mean time I would probably add it to an arraylist of some sort because the next thing I would want it to do is click next page and do it to the following page.
Can anyone shed some light on how to do such a thing? Or perhaps let me know what I need to read up on to get where I'm wanting to go.
•
•
Join Date: Oct 2009
Posts: 18
Reputation:
Solved Threads: 3
1
#4 30 Days Ago
I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?
if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.
Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html
You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)
I did something like this but I have it at home. I'll rescue it at night to show you.
if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.
Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html
You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)
I did something like this but I have it at home. I'll rescue it at night to show you.
•
•
Join Date: Apr 2009
Posts: 114
Reputation:
Solved Threads: 3
0
#5 30 Days Ago
•
•
•
•
I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?
if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.
Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/...x/Pattern.html
You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)
I did something like this but I have it at home. I'll rescue it at night to show you.
What I am wanting to do is create a program that will read a webpages source, grab a specific field, and write what the fields string down in a text file.
So another example would be, you're viewing this forum and lets say there is a field in the source named thread title.
Now in the source you see: thread_title="title1"
A few lines down in the same source you see thread_title again but this time it is: thread_title="title2"
Now lets say there are a hundred of these per page, and I am wanting to get all of them and write them to a text file, once page one is done I want to go to the next page (by clicking the next page button/link) and do the same thing.
So I would want me text file to look like this:
•
•
•
•
title1
title2
title3
title4
etc...
So to answer your question, I am looking how to store the title strings

I'll be reading on patterns, regular expressions, and reading a url in the mean time. Thanks for the help so far guys
Last edited by KirkPatrick; 30 Days Ago at 3:39 pm.
•
•
Join Date: Oct 2009
Posts: 18
Reputation:
Solved Threads: 3
1
#8 29 Days Ago
this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.
Hope this help you.
Java Syntax (Toggle Plain Text)
try { URL url = new URL(baseURL+"/index.php?id="+par); BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream())); String line; while((line=br.readLine())!=null){ line=line.trim(); String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>"; Pattern pattern = Pattern.compile(pat); Matcher matcher = pattern.matcher(line); boolean matchFound = matcher.find(); if (matchFound) { String nombre=matcher.group(1); String urlI=baseURL+"/file/"+nombre+".jpg"; Bajador baj= new Bajador(baseURL,nombre,par); baj.start(); // baja(baseURL,nombre); } } } catch (MalformedURLException ex) { Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex); } catch (IOException e) { Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e); }
Hope this help you.
2
#9 29 Days Ago
You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.
you could also useThe Swing HTML Parser.
Failure is not fatal, but failure to change might be. - John Wooden
•
•
Join Date: Apr 2009
Posts: 114
Reputation:
Solved Threads: 3
0
#10 29 Days Ago
Wow you guys have been very helpful! I'm appreciative for each of your posts
Sorry I had left the other day before getting back to you about it, I had assumed it read straight from the source, but just wanted to make sure. Thanks for confirming that.
----
That helped me quite a bit, I read into patterns and while it seems a bit confusing I understood enough to where I believe you are correct in saying I only needed to search from a-z and 0-9.
I am left a bit confused about some of your code, so if you don't mind I'll just ask the questions here for a better understanding.
I understand that your code is for pulling images off a website (which I might be using later) and I also understand that if a match is found this is where I tell it what to do, but I'm a bit curious about your code here.
The java docs left me a bit confused about the matcher.group. Would you mind explaining the piece of code above?
Instead of bajador, I would put a class that will add the finding to an arraylist and write it to file, unless there is a better option
----
Thanks for pointing this out, this looks like it cuts out a lot of the work :] I'll check into this as well. However, one thing is I'm also looking to interact with the web page too such as clicking buttons, I'm not sure if this would allow that.
----
Another question I have is that will I be able to manipulate menus and what not with similar code. Meaning, I am looking to also do such things through a search feature on forums, so would I be able to click buttons and select things from drop down menus?
Sorry I had left the other day before getting back to you about it, I had assumed it read straight from the source, but just wanted to make sure. Thanks for confirming that.
----
•
•
•
•
this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.
Java Syntax (Toggle Plain Text)
try { URL url = new URL(baseURL+"/index.php?id="+par); BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream())); String line; while((line=br.readLine())!=null){ line=line.trim(); String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>"; Pattern pattern = Pattern.compile(pat); Matcher matcher = pattern.matcher(line); boolean matchFound = matcher.find(); if (matchFound) { String nombre=matcher.group(1); String urlI=baseURL+"/file/"+nombre+".jpg"; Bajador baj= new Bajador(baseURL,nombre,par); baj.start(); // baja(baseURL,nombre); } } } catch (MalformedURLException ex) { Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex); } catch (IOException e) { Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e); }
Hope this help you.
That helped me quite a bit, I read into patterns and while it seems a bit confusing I understood enough to where I believe you are correct in saying I only needed to search from a-z and 0-9.
I am left a bit confused about some of your code, so if you don't mind I'll just ask the questions here for a better understanding.
java Syntax (Toggle Plain Text)
if (matchFound) { String nombre=matcher.group(1); String urlI=baseURL+"/file/"+nombre+".jpg"; Bajador baj= new Bajador(baseURL,nombre,par); baj.start(); // baja(baseURL,nombre); }
I understand that your code is for pulling images off a website (which I might be using later) and I also understand that if a match is found this is where I tell it what to do, but I'm a bit curious about your code here.
The java docs left me a bit confused about the matcher.group. Would you mind explaining the piece of code above?
Instead of bajador, I would put a class that will add the finding to an arraylist and write it to file, unless there is a better option
----
•
•
•
•
You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.
----
Another question I have is that will I be able to manipulate menus and what not with similar code. Meaning, I am looking to also do such things through a search feature on forums, so would I be able to click buttons and select things from drop down menus?
![]() |
Similar Threads
- reading jsp array in to java(servlet) program (JSP)
- C++ to Java (Java)
- Urgent need for JAVA PROGRAMMERS (Software Development Job Offers)
- How to call Client.java from HTML? (Java)
- Java and HTML development enviornment (Java)
- need a java book (Java)
- java.io.StreamCorruptedException when reading from a socket (Java)
Other Threads in the Java Forum
- Previous Thread: string wont send to JList
- Next Thread: Java on debian help please
| Thread Tools | Search this Thread |
-xlint add android api applet application array arrays automation bi binary blackberry bluetooth chat class classes client code compile compiler component converter database digit eclipse equation error event exception fractal freeze functiontesting game gameprogramming givemetehcodez graphics gui health html hyper ide idea image input int integer j2me java javame javaprojects jetbrains jni jpanel jtable julia learningresources linux list login loop main map method methods mobile myregfun netbeans newbie nonstatic notdisplaying page pearl print problem program programming project qt recursion scanner screen scrollbar server set size sms sort spamblocker sql string superclass swing system thread threads time tree variablebinding windows xor






