PHP Crawler??

Question

Joe34 0 Junior Poster in Training

14 Years Ago

My code is suppose to crawl web pages, index the links, then crawl those web pages and on and on again!
But it won't work?
I get no errors what is wrong?
I think it gets into the foreach but doesn't make it to the $DCheck if statement!

<?php

if(empty($_SESSION['page']))
{
    $original_file = file_get_contents("http://www.yahoo.com/");
}
else
{
    $original_file = file_get_contents($_SESSION['page']);
}

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

foreach( $matches[1] as $key => $value)
{
echo "1";
echo "2";
    $Check = mysql_query("SELECT * FROM pages WHERE URL='$value'");
    $DCheck = mysql_num_rows($Check);
    if($DCheck != 0)
    {
mysql_query("INSERT INTO pages (url)
                    VALUES ('$value')");

$_SESSION['page'] = $matches[1];
die($DCheck);
    }

}

?>

php seo

4 Contributors
5 Replies
248 Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by lordspace

All 5 Replies

pritaeas 2,211 ¯\_(ツ)_/¯

14 Years Ago

Are you sure that file_get_contents is allowed to get a remote page ? Not every host allows this.

lordspace 7 Junior Poster in Training

14 Years Ago

imho, you should check whether the link exists in the db before inserting it.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Joe34 0 Junior Poster in Training · Answer 1 · 2011-01-21T18:06:12+00:00

Joe34 0 Junior Poster in Training

14 Years Ago

Then what else can I do to build a php crawler?

blocblue 238 Posting Pro in Training Featured Poster · Answer 2 · 2011-01-21T19:06:20+00:00

Hi,

I built a web crawler in PHP using the function exec to execute the unix shell command wget. It worked rather well.

One thing to bear in mind with crawlers, is to add the URLs to a list, rather than recursively crawling them, as PHP has a recursion limit of 100, although I think from your code you're using the list approach anyway.

R.

Joe34 0 Junior Poster in Training · Answer 3 · 2011-01-22T03:25:16+00:00

Hey everyone I finally fixed this! But I still need your help. I now want to crawl for the h1 tag and also get the most used keywords. How can I get the meta information? Here is what I have.... Also what is an alternative that works fine instead of file_get_contents(). Because file_get_contents won't work on some websites servers like Google's.

<?php

if(empty($_SESSION['page']))
{
    $original_file = file_get_contents("http://www.collegefansite.com/");
   $connect = mysql_connect("127.0.0.1","root","");
if (!$connect)
  {
  die("MySQL could not connect!");
  }

$DB = mysql_select_db('');

if(!$DB)
{
die("MySQL could not select Database!");
}
}
else
{
    $original_file = file_get_contents($_SESSION['page']);
}

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

foreach( $matches[1] as $key => $value)
{
echo "1";
echo "2";

echo 3;
mysql_query("INSERT INTO pages (url)
                    VALUES ('$value')");

$_SESSION['page'] = $matches[1];

}

?>

PHP Crawler??

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers