Hi Everybody!

I'm trying to create my own web accelerator/browser. When you open a page, it will take all the links on that page and preload them. I just have one question: how do you retrieve HTML source code of a page and how do you parse that huge string to find everyhting inside the quotes of an <a></a> tag.

Just so you know, to make a link in HTML, you use the following code: <a href="WHAT I WANT TO PARSE">WHAT TEXT WILL BE DISPLAYED ON THE PAGE</a>

Thanks for all your help.

Recommended Answers

All 8 Replies

You will need to get the complete html data anyway else you can't render it :)

If the data is properfly formatted XHTML it's easy as 1-2-3, just create a DOM parser and look for all "a" tags, then take the href arguments from those.
If it's not properly formatted XHTML you're out of luck and will basically have to write something to do that yourself (and all possible corrupted alternatives, like uppercase and combinations of upper and lowercase).

Well, I actuall just wrote a parser for finding links inside of html just the other day at work.

public static String addTarget(String staticDetail)
{
  String returnUrl
	
  Pattern pattern = Pattern.compile("<+");
  Matcher matcher = pattern.matcher(staticDetail);

  while(matcher.find())
  {
    int lessIndex = matcher.start();
    int greatIndex = staticDetail.indexOf(">", lessIndex + 1);
    int aIndex = staticDetail.indexOf("a", lessIndex + 1);
    int hrefIndex = staticDetail.indexOf("href", aIndex + 1);
    if(aIndex != -1 && hrefIndex != -1)
    {
      if(aIndex < greatIndex && hrefIndex < greatIndex)
      {
        int firstQuoteIndex = staticDetail.indexOf("\"", hrefIndex + 1);
        int secondQuoteIndex = staticDetail.indexOf("\"", firstQuoteIndex  + 1);
        returnURL = staticDetial.subString(fristQuoteIndex, secondQuoteIndex);
      }
    }
  }
  return returnUrl;
}

Now, I re-did some of the code above to fit your needs better and I didn't test it out.

Regards,

Nate

Thanks, Nate! Does that return all of the links or just one of them? Also, how do you retrieve HTML from a web page

Thanks

One more thing: I made a web browser (the code is below) but it doesn't work well on some sites. For example, it won't connect to GMail. If you have any suggestions I would appreciate them.

WEB BROWSER CODE:

import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
import javax.swing.event.*;
import java.io.*;
import java.net.*;

public class Main extends JFrame
{
  private JTextField enterField;
  private Button goToURL;
  private JEditorPane contentsArea;
  private JPanel top;

  public Main ()
  {
    super("Alpha Browser");
    setSize(500,400);
    setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
    setVisible(true);

    Container container = getContentPane();

    top = new JPanel();
    enterField = new JTextField(40);
    goToURL = new Button("Visit");
    goToURL.addActionListener(
        new ActionListener()
        {
          public void actionPerformed (ActionEvent event)
          {
            loadPage(enterField.getText());
          }
        }
    );

    top.add(enterField);
    top.add(goToURL);

    container.add(top, BorderLayout.NORTH);

    contentsArea = new JEditorPane();
    contentsArea.setEditable(false);
    contentsArea.addHyperlinkListener(
        new HyperlinkListener()
        {
          public void hyperlinkUpdate(HyperlinkEvent event)
          {
            if(event.getEventType() == HyperlinkEvent.EventType.ACTIVATED)
              loadPage(event.getURL().toString());
          }
        }
    );

    container.add( new JScrollPane(contentsArea),
                   BorderLayout.CENTER);

    setContentPane(container);
  }

  private void loadPage(String loc)
  {
    try
    {
      contentsArea.setPage(loc);
      enterField.setText(loc);
    }
    catch (IOException ioException)
    {
      JOptionPane.showMessageDialog(null,
                                    "Unable to contact URL.\n\nPossible reasons for error:\n"+
                                    "1.) Server Timeout\n2.) Mis-typed URL\n3.) Internet connection error\n"+ioException.toString(),
                                    "Error in Contacting Given URL",
                                    JOptionPane.ERROR_MESSAGE);
    }
  }

  public static void main (String [] args)
  {
    Main main = new Main();
  }
}

Yes, the posted code will get every url in a html document.

I also looked at your code and ran it on my machine (java 1.5) and it seems to connect to gmail just fine.

Regards,

Nate

Thanks Nate.

Two questions:
First, how do I get the HTML code into a String?
Second, what do you pass through your method and what does it return?

Thanks.

Hooknc, would you mind answering my questions? Thanks.

Hooknc, would you mind answering my questions? Thanks.

Sure.

I actually don't know how to get the html. I tried about 4 different ways of getting the html and wan't good at doing it. (InputStreams really arn't my strong point.) I don't know what my problem was. I was hoping that the Textarea would return the html, but it really removes A LOT of the html and that isn't a good solution.

The method that was written actually needs to be worked over for your purpose. It should actually be returning a List and where the returnUrl gets set...that url should actually added to the list.

Regards,

Nate

A JTextPane will use a filter to format the text. That filter will probably (I've not tried) also be applied when retrieving the text.
Try a JEditorPane instead (maybe just casting it to JEditorPane and asking for the text will be enough), or try getting the text through the model instead of directly.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.