Read Webpage HTML - Help with understanding code involving threading

Question

JOSheaIV 119 C# Addict

12 Years Ago

So awhile back I came here asking for help trying to read in data from a webpage, more specifically the HTML. I was running into an issue where the data wasn't actually being read (as if it hadn't loaded in). I had to put the program aside for awhile as life got busy, but I finally have time to come back to it.

The page I am trying to read in looks like this (it's a roster and there are many of them, each for a different clan)
http://worldoftanks.com/community/clans/1000000954-SAC/

Now I initally tried using code like the following

WebClient myWebClient = new WebClient(); // Create a new WebClient instance.

string dataDownloaded = "";

try
{
    byte [] myDataBuffer = myWebClient.DownloadData(input [i]); 
        //download the Web resource and save it into a data buffer.

    dataDownloaded = Encoding.UTF8.GetString(myDataBuffer); //moves the downloaded data to the string variable
}
catch (Exception error)
{
    MessageBox.Show(error.ToString());
}

Unfortantly this would not work, and instead of reading in HTML like this (this is just what one of the many lines look like, up to 100 players per clan, so up to 100 of these clumps of code)

<TD class="number t-number">1</TD>
<TD class="name t-name b-user js-rendered-template"><A href="/community/accounts/1000391751-AgentAlaskan/">AgentAlaskan</A></TD>
<TD class="role js-role js-rendered-template">Soldier</TD>
<TD class="member_since js-member-since js-rendered-template">21.06.2011</TD></TR>
<TR class="even clan-role-commander">

All I would get would be something like this (note there should be a space after "member_since ..." but it messed up the quote)

  <tbody id="member_table_container">
                  <tr class="js-template js-hidden">
                        <td class="number t-number"></td>
                        <td class="name t-name b-user"></td>
                        <td class="role js-role"></td>
                        <td class="member_since js-member-since"></td>
                  </tr>
            </tbody>

Well I searched around the web trying to find a way to read this data. I was pretty sure it had to deal with javascript, or something like that. Then finally I stumboled on a piece of code that worked for my needs. The code looks like this (Sorry for format, copied straight from my program)

namespace WoTClanRead_3v3
{
//==========================================================================================
    class readInClanRoster
    {
//------------------------------------------------------------------------------------------
        private string GeneratedSource
        {
            get;
            set;
        }
//------------------------------------------------------------------------------------------
        private string URL
        {
            get;
            set;
        }
//------------------------------------------------------------------------------------------
        public string GetGeneratedHTML (string url)
        {
            URL = url;

            Thread t = new Thread(new ThreadStart(WebBrowserThread));
            t.SetApartmentState(ApartmentState.STA);
            t.Start();
            Cursor.Current = Cursors.WaitCursor;

            t.Join();
            Cursor.Current = Cursors.Default;


            return GeneratedSource;
        }
//------------------------------------------------------------------------------------------
        private void WebBrowserThread ()
        {
            WebBrowser wb = new WebBrowser();
            wb.Navigate(URL);

            wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted);

            while (wb.ReadyState != WebBrowserReadyState.Complete)
                Application.DoEvents();

            //Added this line, because the final HTML takes a while to show up 
            GeneratedSource = wb.Document.Body.InnerHtml;

            wb.Dispose();
        }
//------------------------------------------------------------------------------------------
        private void wb_DocumentCompleted (object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            WebBrowser wb = (WebBrowser) sender;
            GeneratedSource = wb.Document.Body.InnerHtml;
        }
//------------------------------------------------------------------------------------------
    }
//==========================================================================================
}

Well initally the code worked out great, it was finally read in the data, that for the longest time I was having to go to the developing tools in IE and save a .txt file that the program would parse and get the data I required.

There are two problems with this code however. One I really don't fully understand how it works, and this leads to my second problem. For some reason, sometimes the page will still not load in properly and I won't get the code i need (it will act like that 2nd quote I pasted above)

So I assume, "oh the page didn't load up, let's run it again". Well then I stumboled on a problem. If the page doesn't load (I determine this by taking the data read in from the string, and parse it, and if it realizes it can't find any player's links, then it didn't get the full page), I stop the process of parsing, and I ask the user to try again.

Try again meaning click the button that starts the process again, and go over the whole process. Well the problem I have noticed lately is that if it fails once, it will fail again, and on the third try, the program pretty much locks up.

If you notice in the code, there's code for a wait cursor, kind of like a debugging tool. On that third try, the cursor never goes back to default, and based on what I know from coding, it's as if it gets stuck in the while loop.

So my question now is, why is this happening? And can someone help explain to me this code I found on the web? Thanks in advance for the help

(sorry this was a lot, but I feel good details help explain the situation better).

html-css multithreading

Edited 12 Years Ago by JOSheaIV

2 Contributors
2 Replies
291 Views
1 Week Discussion Span
Latest Post 12 Years Ago Latest Post by JOSheaIV

Teme64 215 Veteran Poster

12 Years Ago

I don't see any point for using threading in this code.

I wrote a simple example how to get the content of the final page i.e. not HTML code.
This code uses a button to start loading, a label to show what's happening and a multiline textbox where I dump the result. You would use a parser at that point.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Drawing;
using System.Text;
using System.Windows.Forms;

namespace BlogTest
{

    public partial class Form1 : Form
    {
        private WebBrowser wb;
        private string documentHTML = "";

        public Form1()
        {
            InitializeComponent();
            // 
            wb = new WebBrowser(); // A new browser instance
            label1.Text = "";
        }

        // Start loading a web page
        private void button1_Click(object sender, EventArgs e)
        {
            // URL as a hardcoded string
            string url = @"http://worldoftanks.com/community/clans/1000000954-SAC/";

            // Add a handler to handle document loading complete event
            wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted);
            // Start loading
            label1.Text = "Loading, please wait...";
            button1.Enabled = false;
            textBox1.Text = "";
            wb.Navigate(url);
        }

        private void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            label1.Text = "Finished loading the page";
            button1.Enabled = true;
            // Get all the content from the page
            documentHTML = wb.Document.Body.OuterText;
            // Here goes the parser for data, I just dump the HTML to a textbox
            textBox1.Text = documentHTML;
        }
    }
}

The basic idea in the code is pretty much the same as in your code. Main difference is using wb.Document.Body.OuterText instead of wb.Document.Body.InnerHTML. Hope this helps a bit :)

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

JOSheaIV 119 C# Addict · Answer 1 · 2012-07-01T22:58:20+00:00

I thought this code looked somewhat familiar. It doesn't work, I get an old error I have seen before.

It doesn't give me any data at all. Something about the code I had allowed the page to truly load all off it's data not just this

Log in or create account
Go to region
Russia

Europe

North America

Southeast Asia

Attention!
Your account is valid only within one region.Portal Forum Wiki Support /community/clans/1000000954-SAC/ HOME GAME Download Game Events Calendar Updates Full Guide Tankopedia Newcomer's Guide Account Security Clan Wars Global Map Clan Wars Guide Tournaments MEDIA Video Screenshots Art Renders COMMUNITY Players Clans Hall of Fame Useful Software Created: 19.05.2011 [SAC] Secret Agent Clan
There is no T95, there is only ZUUL!
Data as of 01.07.2012 16:36
Commander: Agent_Marshal Military Personnel (92)

User Name Position Joined

Clan Details

Battle Schedule

Owned Provinces

Go to List of Clans

© 2012 Wargaming.net All rights reserved.
Powered by BigWorld Technology™ © World of Tanks
Download Game
Clan Wars
Tournaments
Account
Create Account
Reset Password
Account Security
Community
Forum
Clans
Players
Hall of Fame
Support
Contact Us
EULA
Privacy Policy
Terms of Service

So no bingo yet.