0

My code is suppose to crawl web pages, index the links, then crawl those web pages and on and on again!
But it won't work?
I get no errors what is wrong?
I think it gets into the foreach but doesn't make it to the $DCheck if statement!

<?php

if(empty($_SESSION['page']))
{
    $original_file = file_get_contents("http://www.yahoo.com/");
}
else
{
    $original_file = file_get_contents($_SESSION['page']);
}

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

foreach( $matches[1] as $key => $value)
{
echo "1";
echo "2";
    $Check = mysql_query("SELECT * FROM pages WHERE URL='$value'");
    $DCheck = mysql_num_rows($Check);
    if($DCheck != 0)
    {
mysql_query("INSERT INTO pages (url)
                    VALUES ('$value')");

$_SESSION['page'] = $matches[1];
die($DCheck);
    }

}

?>
4
Contributors
5
Replies
9
Views
6 Years
Discussion Span
Last Post by lordspace
0

Then what else can I do to build a php crawler?

0

Hi,

I built a web crawler in PHP using the function exec to execute the unix shell command wget. It worked rather well.

One thing to bear in mind with crawlers, is to add the URLs to a list, rather than recursively crawling them, as PHP has a recursion limit of 100, although I think from your code you're using the list approach anyway.

R.

0

Hey everyone I finally fixed this! But I still need your help. I now want to crawl for the h1 tag and also get the most used keywords. How can I get the meta information? Here is what I have.... Also what is an alternative that works fine instead of file_get_contents(). Because file_get_contents won't work on some websites servers like Google's.

<?php

if(empty($_SESSION['page']))
{
    $original_file = file_get_contents("http://www.collegefansite.com/");
   $connect = mysql_connect("127.0.0.1","root","");
if (!$connect)
  {
  die("MySQL could not connect!");
  }

$DB = mysql_select_db('');

if(!$DB)
{
die("MySQL could not select Database!");
}
}
else
{
    $original_file = file_get_contents($_SESSION['page']);
}

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

foreach( $matches[1] as $key => $value)
{
echo "1";
echo "2";

echo 3;
mysql_query("INSERT INTO pages (url)
                    VALUES ('$value')");

$_SESSION['page'] = $matches[1];

}

?>
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.