Java html reading

Question

KirkPatrick 28 Junior Poster

15 Years Ago

I haven't ever messed with any webpages through java and I'm kind of curious as to how it would work. I have a specific question that perhaps someone will be able to answer.

How would one go about pulling certain information from a forum? To better understand what I am asking, I'll give an example:

Lets say I am surfing this forum and am currently at:

http://www.daniweb.com/forums/forum9.html

Then I click view source, this gives me the source of the current page opened.

Now lets say I want to pull out every title in there. Ex. title=""

After pulling out each title I would want to add it to a text file or something of the sort so I can refer to it later. For the mean time I would probably add it to an arraylist of some sort because the next thing I would want it to do is click next page and do it to the following page.

Can anyone shed some light on how to do such a thing? Or perhaps let me know what I need to read up on to get where I'm wanting to go.

html-css java

4 Contributors
18 Replies
429 Views
1 Week Discussion Span
Latest Post 15 Years Ago Latest Post by KirkPatrick

All 18 Replies

Ezzaral 2,714 Posting Sage

15 Years Ago

You can start with Reading Text From A URL and Regular Expressions.

KirkPatrick commented: you've always been helpful around here bud, the world needs more helpful people like you +1

santiagozky 1 Newbie Poster

15 Years Ago

I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?

if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.

Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)

I did something like this but I have it at home. I'll rescue it at night to show you.

KirkPatrick commented: thank you for the helpful post, i'm looking forward to your example. Cheers +1

santiagozky 1 Newbie Poster

15 Years Ago

this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.

try {
                    URL url = new URL(baseURL+"/index.php?id="+par);
                  BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream()));
                  String line;
                  while((line=br.readLine())!=null){
                      line=line.trim();
                      String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";
                       Pattern pattern = Pattern.compile(pat);
                       Matcher matcher = pattern.matcher(line);
                       boolean matchFound = matcher.find();
                        if (matchFound) {
                            String nombre=matcher.group(1);
                            String urlI=baseURL+"/file/"+nombre+".jpg";
                            Bajador baj= new Bajador(baseURL,nombre,par);
                            baj.start();
                          //  baja(baseURL,nombre);
                           
                        }
                  }
                } catch (MalformedURLException ex) {
                    Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
                } catch (IOException e) {
                    Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
                }

Hope this help you.

kvprajapati 1,826 Posting Genius

15 Years Ago

You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.

Ezzaral commented: Good suggestion. +9

KirkPatrick commented: thank you for pointing that out, I figured there was something of the sort created just didnt know the name. Much appreciated! +1

santiagozky 1 Newbie Poster

15 Years Ago

The concept of a group is quite simple. Look at my regexp:

="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";

Note that \\w* in within parenthesis. This tells the pattern to 'store' what is in that portion of the regular expression (the filename in my case), then you can access the stored string with matcher.group(1); (0 is the entire expression and 1 is the first group defined in the expression).

if your regular expression look like (car([0-9]*)) the the first group is (car([0-9]*)) (car,car1,car11,car16, etc) and the secound group is ([0-9]*) (empty,1,3,7,etc).

This is because sometimes you need to find a pattern and use information from it. Groups allow you to do this.

Edited 15 Years Ago by santiagozky because: corrected what I say about group numbers

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

you've always been helpful around here bud, the world needs more helpful people like you
thank you for the helpful post, i'm looking forward to your example. Cheers
thank you for pointing that out, I figured there was something of the sort created just didnt know the name. Much appreciated!

KirkPatrick 28 Junior Poster · Answer 1 · 2009-11-06T01:15:41+00:00

Much appreciated response ezzaral, just one quick question when reading the text from the url, does that automatically read the source of the page?

KirkPatrick 28 Junior Poster · Answer 2 · 2009-11-06T01:37:28+00:00

I don't get your question. are you asking how to get all the titles programaticly or how to store the title strings?
if you are asking how to ge tthe titles, I would suggest to read line by line and use a regular expression to get the string you like.
Take a look at the java Pattern class: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
You probably will be doing something like title=\"([a-zA-Z0-9])+\" and then get the string from the capturing group. (I haven't done this in a while so the example is probably wrong, but it is the idea.)
I did something like this but I have it at home. I'll rescue it at night to show you.

I'll see if I can clear up my question a bit better for your understanding.

What I am wanting to do is create a program that will read a webpages source, grab a specific field, and write what the fields string down in a text file.

So another example would be, you're viewing this forum and lets say there is a field in the source named thread title.

Now in the source you see: thread_title="title1"

A few lines down in the same source you see thread_title again but this time it is: thread_title="title2"

Now lets say there are a hundred of these per page, and I am wanting to get all of them and write them to a text file, once page one is done I want to go to the next page (by clicking the next page button/link) and do the same thing.

So I would want me text file to look like this:

title1
title2
title3
title4
etc...

I hope that clears up what I am wanting to do for you, if you have an example that would be awesome. I tend to do better when reading how things work if I have an example in front of me as well.

So to answer your question, I am looking how to store the title strings :)

I'll be reading on patterns, regular expressions, and reading a url in the mean time. Thanks for the help so far guys

Ezzaral 2,714 Posting Sage Team Colleague Featured Poster · Answer 3 · 2009-11-06T01:49:11+00:00

Much appreciated response ezzaral, just one quick question when reading the text from the url, does that automatically read the source of the page?

Give it a quick try and you'd have a definitive answer. :)

Ezzaral 2,714 Posting Sage Team Colleague Featured Poster · Answer 4 · 2009-11-06T02:18:29+00:00

Ezzaral 2,714 Posting Sage

15 Years Ago

(Yes, the HTML page source is just marked up text)

KirkPatrick 28 Junior Poster · Answer 5 · 2009-11-06T23:05:30+00:00

Wow you guys have been very helpful! I'm appreciative for each of your posts

(Yes, the HTML page source is just marked up text)

Sorry I had left the other day before getting back to you about it, I had assumed it read straight from the source, but just wanted to make sure. Thanks for confirming that.
----

this is some code I wrote to make a custom crawler to get images from a site. just change the pattern and the method that stores the string (Bajador is a class that downloads the image in my case) and put everything in a loop that goes through all the urls you wish to examine.

try {
                    URL url = new URL(baseURL+"/index.php?id="+par);
                  BufferedReader br= new BufferedReader( new InputStreamReader(url.openStream()));
                  String line;
                  while((line=br.readLine())!=null){
                      line=line.trim();
                      String pat="<img src=\"/(\\w*).jpg\" alt=\"Picture\"/></div>";
                       Pattern pattern = Pattern.compile(pat);
                       Matcher matcher = pattern.matcher(line);
                       boolean matchFound = matcher.find();
                        if (matchFound) {
                            String nombre=matcher.group(1);
                            String urlI=baseURL+"/file/"+nombre+".jpg";
                            Bajador baj= new Bajador(baseURL,nombre,par);
                            baj.start();
                          //  baja(baseURL,nombre);
                           
                        }
                  }
                } catch (MalformedURLException ex) {
                    Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
                } catch (IOException e) {
                    Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
                }

Hope this help you.

That helped me quite a bit, I read into patterns and while it seems a bit confusing I understood enough to where I believe you are correct in saying I only needed to search from a-z and 0-9.

I am left a bit confused about some of your code, so if you don't mind I'll just ask the questions here for a better understanding.

if (matchFound) {
                            String nombre=matcher.group(1);
                            String urlI=baseURL+"/file/"+nombre+".jpg";
                            Bajador baj= new Bajador(baseURL,nombre,par);
                            baj.start();
                          //  baja(baseURL,nombre);
                           
                        }

I understand that your code is for pulling images off a website (which I might be using later) and I also understand that if a match is found this is where I tell it what to do, but I'm a bit curious about your code here.

The java docs left me a bit confused about the matcher.group. Would you mind explaining the piece of code above?

Instead of bajador, I would put a class that will add the finding to an arraylist and write it to file, unless there is a better option
----

You could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge.
you could also useThe Swing HTML Parser.

Thanks for pointing this out, this looks like it cuts out a lot of the work :] I'll check into this as well. However, one thing is I'm also looking to interact with the web page too such as clicking buttons, I'm not sure if this would allow that.

----

Another question I have is that will I be able to manipulate menus and what not with similar code. Meaning, I am looking to also do such things through a search feature on forums, so would I be able to click buttons and select things from drop down menus?

KirkPatrick 28 Junior Poster · Answer 6 · 2009-11-07T00:18:22+00:00

So essentially my if(matchFound) would be much simpler, like so?

ArrayList info = new Arraylist();

                        if (matchFound) {
                            String name=matcher.group(1);
                            info.add(name); //just add the info from the string inside title
                                                       
                        }
                      //print out info in arraylist
                      //write info from arraylist to file

Thank you for going into detail about the groups, that is quite helpful. Just curious, but why do you include: alt=\"Picture\"/></div> ?

santiagozky 1 Newbie Poster · Answer 7 · 2009-11-07T00:32:47+00:00

Thank you for going into detail about the groups, that is quite helpful. Just curious, but why do you include: alt=\"Picture\"/></div> ?

the page had a lot of pictures, I only wanted the ones with that attribute. The program was intended to work with an specific site.

KirkPatrick 28 Junior Poster · Answer 8 · 2009-11-07T00:34:54+00:00

the page had a lot of pictures, I only wanted the ones with that attribute

Alright, thanks bud. I will continue working on this side project of mine and I'll be sure to report back on the progress.

I will need to look into expressions a bit more as i don't fully understand them all, some seem pretty complicated.

Also, are you aware of how to interact with websites? Such as the buttons, drop menus, etc?

santiagozky 1 Newbie Poster · Answer 9 · 2009-11-07T00:41:08+00:00

Alright, thanks bud. I will continue working on this side project of mine and I'll be sure to report back on the progress.
I will need to look into expressions a bit more as i don't fully understand them all, some seem pretty complicated.
Also, are you aware of how to interact with websites? Such as the buttons, drop menus, etc?

Im glad to help.

No, I have never tried to do anything like that. I suppose that you would need a complete embedded browser (with javascript most likely) in order to use javascript to simulate the events. But I'm just thinking at loud

KirkPatrick 28 Junior Poster · Answer 10 · 2009-11-07T01:02:01+00:00

Sorry, I must have missed this in my previous posts, but what did your baseUrl and par look like?

I would assume it is something like this:

baseUrl = "http://www.daniweb.com/forums/forum9"
par = "-desc-"(0-9)".html"

Or am I thinking of it wrong?

KirkPatrick 28 Junior Poster · Answer 11 · 2009-11-07T02:08:14+00:00

I was testing it out and am wondering why it only prints out 1 letter at a time?

Here is my example code:

String link = "http://www.daniweb.com/forums/forum9.html";
        ArrayList pageInfo = new ArrayList();
        
        try {
            URL url = new URL(link);
            BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
            
            String line;

            while((line = br.readLine()) != null) {
                line = line.trim();
                String pat = "<div class=\"([a-zA-Z0-9])+\""; //pattern check between a-z and 0-9 -
                Pattern pattern = Pattern.compile(pat);
                Matcher matcher = pattern.matcher(line);
                
                boolean matchFound = matcher.find();
                if (matchFound) {
                    String name = matcher.group(1);
                    pageInfo.add(name);
                    System.out.println(name);
                    
                }
            }
            
            System.out.println(pageInfo);
            
        } catch (MalformedURLException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException e) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, e);
        }
        
    }

and here is what the output is:

p
d
g
x
c
c
c
c
v
i
e
e
v
x
c
c
c
c
2
x
c
c
c
c
t
d
m
t
[p, d, g, x, c, c, c, c, v, i, e, e, v, x, c, c, c, c, 2, x, c, c, c, c, t, d, m, t]

I assume I would have to change my pattern?

santiagozky 1 Newbie Poster · Answer 12 · 2009-11-07T03:46:56+00:00

Yes, the problem is the pattern:

([a-zA-Z0-9])+

[a-zA-Z0-9] means 1 character letter o number.
+ means 'at least 1 time'.
But you are saying "at least one time the group consisting of one character" and you should say "the group consisting of at least one character"

KirkPatrick 28 Junior Poster · Answer 13 · 2009-11-12T22:40:09+00:00

I apologize that I haven't had the chance to reply back with progress. I have managed to get it to do what I was intending.

My next step is being able to click buttons and links through my java program. I believe I need to read up on http post.

Thanks for all your help, its much appreciated

Java html reading

Recommended Answers Collapse Answers

All 18 Replies

Recommended Answers