I just launched my first website I fully wrote myself in php/mysql - www.TomPaineToday.com. After testing it on my own computer for a while, I decided I might as well launch it and see what problems I run into.

The first issue I came across was that my link redirect counter loves counting the Google Bot. Now, I am sure many people have had this idea in the past and will tell me why I should not do it, but what problems will I cause if I treat Google Bot "clicks" differently then human clicks by filtering IPs?

After I find out that this will blacklist me from Google or whatever other devastating problems it will cause, is there an open list of IP addresses that I should disregard for future click counting?

Thanks

tp

Recommended Answers

All 10 Replies

Googlebot has an insurrmountable number of IPs so you can't do that. You might be able to filter by user agent though. Googlebot always goes by the 'gootlebot' user agent, and there are other bots as well.

You could do it like this in PHP (completely untested code):

$robots = array(
    'googlebot'         => 'Googlebot',
    'msnbot'            => 'MSNBot',
    'slurp'             => 'Inktomi Slurp',
    'yahoo'             => 'Yahoo',
    'askjeeves'         => 'AskJeeves',
    'fastcrawler'       => 'FastCrawler',
    'infoseek'          => 'InfoSeek Robot 1.0',
    'lycos'             => 'Lycos'
);

$is_bot = false;

// Loop through an array of bot keys (googlebot, slurp, etc)
foreach (array_keys($robots) AS $robot)
{   
    // If the user agent contains the name of the bot somewhere in it ...
    if (stripos($_SERVER['HTTP_USER_AGENT'], $robot) !== false)
    {
        // We know we have a bot and no longer need to continue our search
        $is_bot = true;
        break;
    }
}

// If not a bot ...
if (!$is_bot)
{
    // Count visitor
}

Be sure you are 301 redirecting the links. Also, the only problem you can run into with being blacklisted by Google is if you're cloaking, serving different content to googlebot than to a human. So don't just redirect googlebot to one place and everyone else somewhere else :)

Thanks so much Dani!
I perhaps should have added to this discussion the other discussion where this was first posted - http://www.daniweb.com/web-development/php/threads/433186/redirecting-google-bot
I attempted to remove my first posting from the advice of a commenter, then it ended up being appropraite... This is my first day posting here :)

Your solution is basically what I went with, a derivation of the advice from cereal and other input I found when realizing Google wasn't the IP address confusing my stats.

$browser = $_SERVER['HTTP_USER_AGENT'];
if(!preg_match('/bot|google|spider|crawl|curl|^$/i',$browser)) {
// probably not a bot, so go on counting...
}

Still waiting to see if it works on the next crawl. I am not clear on exactly what the first variable does "/bot|google|spider|crawl|curl|^$/i" but I take it that it looks for each of those words to see if they match up...

And yes, I decided that I will not try and outsmart Google. Even the notion to think like that will be removed from my mind for the forseable future. I welcome our Google overlords.

What your version does is use preg_match(), which includes a bit of regex. It's probably a bit more efficient than my version, but a bit more complicated too :) I hate regex!

I'm actually not much of a regex expert, but I think that the ^$ are incorrectly placed in your regex string.

You can try testing it out by hard-coding $browser = 'googlebot' and see if it catches it.

Good suggestion to test it out by hard-coding it.
It failed at first. I started looking into regex and quickly decided I hate regex, but it seemed like what I had should work. So I tested it again and realized the problem was the !. I mixed up my true/false. I believe the regex code works.

I'm almost positive you didn't mix up your true/false. I would hard-code something that it should hit on (like googlebot) and then hard-code something that it should not hit on (like firefox) and make sure that it works for both. I'm willing to bet that the way it always stands it will always hit on true or always hit on false.

Seems to be working.

<?php
    $browser = 'googlebot';
    if(!preg_match('/bot|google|spider|crawl|curl|^$/i',$browser)) {
        echo "humanish";
    } else {echo "bot Match<br />";}

    $browser = 'firefox';
    if(!preg_match('/bot|google|spider|crawl|curl|^$/i',$browser)) {
        echo "humanish<br />";
    } else {echo "bot Match<br />";}

    $browser = 'msnbot';
    if(!preg_match('/bot|google|spider|crawl|curl|^$/i',$browser)) {
        echo "humanish";
    } else {echo "bot Match<br />";}

?>

and I get this returned:

bot Match
humanish
bot Match

I wish I knew why. Far too late to figure out regex. A project for another day.

Thanks again!

I think I understand what it's doing ...I thought that the ^$ were incorrectly placed because I've always seen them to indicate the beginning or end of a string, but placed like that I think it's looking for the empty string, meaning that it will hit on being a bot if no useragent is provided.

ahh. Yes, I just tested it and it does return a positive hit for an empty string.
I guess that it is probably a wise decision to assume a blank user agent is a bot, for all practicle purposes.

Thanks to all for sharing a useful information .

commented: yawn (again) -2
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.