My code is suppose to crawl web pages, index the links, then crawl those web pages and on and on again!
But it won't work?
I get no errors what is wrong?
I think it gets into the foreach but doesn't make it to the $DCheck if statement!

<?php

if(empty($_SESSION['page']))
{
    $original_file = file_get_contents("http://www.yahoo.com/");
}
else
{
    $original_file = file_get_contents($_SESSION['page']);
}

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

foreach( $matches[1] as $key => $value)
{
echo "1";
echo "2";
    $Check = mysql_query("SELECT * FROM pages WHERE URL='$value'");
    $DCheck = mysql_num_rows($Check);
    if($DCheck != 0)
    {
mysql_query("INSERT INTO pages (url)
                    VALUES ('$value')");

$_SESSION['page'] = $matches[1];
die($DCheck);
    }

}

?>

Recommended Answers

All 5 Replies

Are you sure that file_get_contents is allowed to get a remote page ? Not every host allows this.

Then what else can I do to build a php crawler?

Hi,

I built a web crawler in PHP using the function exec to execute the unix shell command wget. It worked rather well.

One thing to bear in mind with crawlers, is to add the URLs to a list, rather than recursively crawling them, as PHP has a recursion limit of 100, although I think from your code you're using the list approach anyway.

R.

Hey everyone I finally fixed this! But I still need your help. I now want to crawl for the h1 tag and also get the most used keywords. How can I get the meta information? Here is what I have.... Also what is an alternative that works fine instead of file_get_contents(). Because file_get_contents won't work on some websites servers like Google's.

<?php

if(empty($_SESSION['page']))
{
    $original_file = file_get_contents("http://www.collegefansite.com/");
   $connect = mysql_connect("127.0.0.1","root","");
if (!$connect)
  {
  die("MySQL could not connect!");
  }

$DB = mysql_select_db('');

if(!$DB)
{
die("MySQL could not select Database!");
}
}
else
{
    $original_file = file_get_contents($_SESSION['page']);
}

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

foreach( $matches[1] as $key => $value)
{
echo "1";
echo "2";

echo 3;
mysql_query("INSERT INTO pages (url)
                    VALUES ('$value')");

$_SESSION['page'] = $matches[1];

}

?>

imho, you should check whether the link exists in the db before inserting it.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.