Extracting Page Source

Question

rEhSi_123 8 Junior Poster

13 Years Ago

Hi Guys,

Need a bit of advice. Basically I am building a webcrawler and in order to do so
I have to extract the page source of a webpage which I can do so like this:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click

        Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(TextBox2.Text)
        Dim response As System.Net.HttpWebResponse = request.GetResponse()

        Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())

        Dim sourcecode As String = sr.ReadToEnd
        TextBox1.Text = sourcecode

    End Sub

The above code works fine on most of the websites I have tried to extract page source from but for some reason it is failing to extract the html contents of few websites like these where the message posted by a user is nowhere to be seen? The webpage in question is this: http://www.vbforums.com/showthread.php?t=654378

Is there something I have missed or is it due to forum protection etc which is preventing the vb application from extracting the whole page source?

Please advice

vb.net

2 Contributors
6 Replies
229 Views
3 Days Discussion Span
Latest Post 13 Years Ago Latest Post by rEhSi_123

All 6 Replies

codeorder 197 Nearly a Posting Virtuoso

13 Years Ago

Mine is also returning a "The remote server returned an error: (404) Not Found." error.

In such cases, a hidden WebBrowser could do the trick by extracting the .OutterText.

Public Class Form1

    Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
        Me.Cursor = Cursors.WaitCursor
        WebBrowser1.Navigate("http://www.vbforums.com/showthread.php?t=654378")
    End Sub

    Private Sub WebBrowser1_DocumentCompleted(sender As System.Object, e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
        TextBox1.Text = WebBrowser1.Document.Body.OuterText
        Me.Cursor = Cursors.Default
    End Sub
End Class

Locating the "Originally Posted by..." line, you know that the following line(s) contain the post. The signature line of "______" could stop the content extraction, although in some cases there might not be a signature.

Hope this helps and good luck.

Edited 13 Years Ago by codeorder because: n/a

codeorder 197 Nearly a Posting Virtuoso

13 Years Ago

Glad I could help. :)

You can also extract other stuff from the WebBrowser:

With WebBrowser1.Document.Body
            TextBox1.Text = .OuterText
            'TextBox1.Text = .InnerText
            'TextBox1.Text = .InnerHtml
            'TextBox1.Text = .OuterHtml
        End With

I tested the .InnerHtml and it did not return the posts in it. Usually, .InnerHtml is the source code you need when extracting data.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

rEhSi_123 8 Junior Poster · Answer 1 · 2011-07-06T16:32:28+00:00

Mine is also returning a "The remote server returned an error: (404) Not Found." error.

In such cases, a hidden WebBrowser could do the trick by extracting the .OutterText.

End Class
Locating the "Originally Posted by..." line, you know that the following line(s) contain the post. The signature line of "______" could stop the content extraction, although in some cases there might not be a signature.

Hope this helps and good luck.

codeorder thanks and you might have saved me from writing an aditional code for parsing the data lol

Anyhow I made the following changes according to my needs:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        WebBrowser1.Navigate(TextBox2.Text)
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
        TextBox1.Text = WebBrowser1.Document.Body.OuterText
End Sub

I would like to add is that it seems to return me only the text from the webpage without any HTML, like shown below. Is it how webbrowser works? Because end of the day my application will be a forum webcrawler which will only be interested in forum posts bar the html..........

Please advice

 IT Professionals
 IT Management
 CIO Update
 Security
 Storage
 Server
 Networking
 Small Business
 Enterprise Applications
 Database
 Mobile
 IT News
 Developers
 Architect / Enterprise
 General Development / OS
 Java Development / OS
 Microsoft Technology
 PHP Development
 Web Development
 Solutions
 HotList
 Video
 eBook Library
 Webopedia
 Login
 Manage My Profile
 Register
 Why Join?

To register for an Internet.com membership to receive newsletters and white papers, use the Register button ABOVE.
To participate in the message forums BELOW, click here

Visual Basic Code from FreeVBCodeGet and Set Word file Attributes with VB.NET
Calculate Age in Visual Basic 2005, Counting Leap years
Spatial Matrix Memory Game
Introduction to Cryptoanalysis
Linked List implementation in Visual Basic
Submit your code to FreeVBCode

VBForums > Visual Basic > Visual Basic .NET
VS 2010 Syncronize Files
User NameRemember Me?
Password

RegisterFAQCalendarToday's PostsVB JobsSearch

Search Forums

Show Threads Show Posts
Tag Search
Advanced Search

Go to Page...

Thread Tools Display Modes

Yesterday, 02:28 PM #1
johnweidauer
Hyperactive Member

Join Date: Sep 02
Location: SLC, UT
Posts: 307
Syncronize Files

I have written a stupid backup utility that copies new or modified files. That's fantastic until I come across files that are 11 GB in size that have been modified. I have a feeling I can compare the file to be backed up and the file that will be over-written (old version of file) and "synchronize" the file instead of copying the entire 11 GB. I am not sure how to start researching this, could anyone point me in the right direction?

To the world you may just be one person, but to this one person, you just might be the world.

johnweidauer
View Public Profile
Find More Posts by johnweidauer

Yesterday, 06:17 PM #2
weirddemon
Wait... what?

Join Date: Jan 09
Location: USA
Posts: 3,232
Re: Syncronize Files

Hm... if the one file is 11GBs in size, I don't think you can copy bits of it like I think you're suggesting. The only thing you can do, as far as I know, is compare the modified dates and back it up like that.

It seems odd that any one file is 11GBs though. What kind of file is it? I'm intrigued.

CodeBank contributions: Process Manager, Temp File Cleaner
Virus problem? No problem! Try AVERT

AVERT Twitter
AVERT Facebook Page

etc...................

rEhSi_123 8 Junior Poster · Answer 2 · 2011-07-07T04:43:12+00:00

Glad I could help. :)
You can also extract other stuff from the WebBrowser:
With WebBrowser1.Document.Body
            TextBox1.Text = .OuterText
            'TextBox1.Text = .InnerText
            'TextBox1.Text = .InnerHtml
            'TextBox1.Text = .OuterHtml
        End With
I tested the .InnerHtml and it did not return the posts in it. Usually, .InnerHtml is the source code you need when extracting data.

'codeorder' thanks again for your help matey :) .InnerHTML and .OuterHTML seem to extract the same html content unless I was seeing it differently? and similarly same for .InnerText and .OuterText........

Also I wanted to ask you this:

Yes I am able to extract the text without the HTML which is what I am after but more or so I was wondering is there a way of extracting the USER posts only with the date of post and so on so forth? Please advice :)

For example in this forum user has posted this
1.

Hi guys, I have tis select statement and I am struggling with it on what I want to display so if anyone could help.
("SELECT MovieDetails.MovieID,MovieDetails.ActorID,
Movies.MovieID, Movies.FilmName,Movies.CatID,Actors.ActorID,
Actors.ActorName, Movies.Poster, Categories.CatName,
Categories.CatID, Movies.FilmDate,
FROM MovieDetails INNER JOIN Actors ON
MovieDetails.ActorID = Actors.ActorID INNER JOIN Movies
ON MovieDetails.MovieID = Movies.MovieID INNER
JOIN Categories on Movies.CatID = Categories.CatID")
This select statement is populated in a dataset and into bindingsource. The actors names are filled in a Listbox so when an Actor is selected, the BindingSource is filtered and display all the movies that Actor is in and display with the details, so the bidingcontext can be move forward and backwards.. Now when the movie is display "only that particular Actor's name is displayed". So how can I display all the Actors names when a movie is selected in this filter. Thanks

But how can I extract this alone and store it say in a dB?

2. Also what about a page containing xml?

codeorder 197 Nearly a Posting Virtuoso · Answer 3 · 2011-07-09T22:49:13+00:00

You could probably locate the "Posts:..." line and extract from there. Storing in a db, not a db coder here.

Pages containing xml, if a xml page and no html involved, then post some xml content and specify what you want to extract. Otherwise, post html & xml page content.

rEhSi_123 8 Junior Poster · Answer 4 · 2011-07-09T23:38:49+00:00

You could probably locate the "Posts:..." line and extract from there. Storing in a db, not a db coder here.
Pages containing xml, if a xml page and no html involved, then post some xml content and specify what you want to extract. Otherwise, post html & xml page content.

Well I can extract the posts now but by only knowing the id as shown below:

TextBox1.Text = WebBrowser1.Document.GetElementById("post-1670732").InnerText

But is there any way of either using regex etc to find this sort of id and
posting the result?

I have noticed also forums use like

<a href="/software-development/vbnet/58">

can it be possible to extract this during forum visits and output the data?

Extracting Page Source

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers