Hi all. I need to code an offline crawler that acts like a web crawler on a set of given html pages. I know how to read the a=href tag and output those links to the screen. My plan is to modify my code to add more functionality. Instead of outputting the links to the screen I want to put them in the data store and count the links to the different pages. After that I want to access the link to the next html page in the data store and repeat the process. I would like to output the top five pages with the most links to them to the screen. I know that is an earful, but any help would be appreciated. Below is my code so far. I have it commented as best I could

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
    //Declaring a variable of type fstream to store the data
    fstream data_store;
    //Declaring two variables of type string to hold the file name and a line of text
    string filename, line;
    //Declare a variable of type integer to be used as a counter.
    int counter1;

    //Output a message to the screen
    cout << "Enter a file name" << endl;
    //Input the filename
    cin >> filename;
    //Open the file that was passed into the data store and read data from it
    data_store.open(filename.c_str(), ios::in);
    //while there is a line to pass into the data store or if the data_store does not reach the end of the file
    while(data_store >> line || !data_store.eof())
    {
        //Declare a temporary string variable and initialize it to blank
        string temp = "";
        //Initializes counter to zero every loop iteration
        counter1 = 0;
        //loops through the text until the line hits "href="
        if(!line.find("href="))
        {
            //The line of text gets the text starting at the 6th subscript to the size of the text minus six
            line = line.substr(6, counter1 - 6);
            //While the line at a subscript does not have single or double quotes,
            //and if the counter is less than the line's lengnth
            while(line[counter1] != '\"' && line[counter1] != '\'' && counter1 < line.length())
            {
                    //Put the contents of line into temp
                    temp += line[counter1++];
            }
            
            //Ouput temp on a new line
            cout << temp << endl;
        }
    }
    //This closes the data store to prevent memory leaks
    data_store.close();
}

OK, I think I see roughly what you're trying to do. I have some general points though:
line 22: Be careful of reading text in like this, it's delimited by any whitespace. So for a line like

<a href="http://www.some.url.com" target="_blank">

your variable line with take the value <a then ref="http://www.some.url.com" then target="_blank"> . In your case it doesn't matter, but since your variable is called line and not something like particle , it implies that you might think you're reading in the whole line.

line 32: counter1 is guaranteed to be zero at this point, so counter1 - 6 is going to try and extract a substring of length -6! I think you mean

line = line.substr(6, line.length() - 6);

Also, before you do this, you should check that line has enough elements in it.

I hope that helps a bit :o)

Edited 5 Years Ago by ravenous: Corrected text

This article has been dead for over six months. Start a new discussion instead.