3
Contributors
5
Replies
6
Views
8 Years
Discussion Span
Last Post by BrotherBill
0

This is a "robots.txt" file that anyone can copy and place in thier root directory:


User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /private/

User-agent: Mediapartners-Google*
Disallow:

User-agent: Fasterfox
Disallow: /

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Xenu Link Sleuth 1.2g
Disallow: /

User-agent: Xenu
Disallow: /

Sitemap: http://yoursite.com/sitemap.xml

------------------------------------------------------------------------

Copy the above code and save as robots.txt, placing it in the root directory of your website.

Adding the robots meta tag to your main page:

The "NAME" attribute must be "ROBOTS".

Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out.

This leaves us with only three variations:

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Like any <META> tag it should be placed within the HEAD section of the page


Lets take a look at the different parts of the robots.txt file above:

User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /private/

The global no follow statements: User-agent: name of spider or * wildcard. Using the * or wildcard tells "ALL" spiders to ignore these directory paths. You can protect an entire directory and all of it's subdirectories or just a portion of a directory. Keep it as simple as possible.

User-agent: Fasterfox
Disallow: /

This statement is telling the spider Fasterfox that it must ignore the root / directory and all of its sub directories. The forward slash / simply means ignore "EVEYTHING". Just as in the example above you can also tell the spiders they can only access certain directories. This is rarely used since we are focusing on keeping the dirty spiders from crawling any portion our websites.

User-agent: Mediapartners-Google*
Disallow:

Bt leaving the dissallow path empty we are ALLOWING or requesting that this particular user-agent crawl the entire site, except for the paths that we've already placed in the global no follow. In this example we are ALLOWING a Google partner to access everything except the cgi-bin, admin, and private directories which are already blocked.

You can't use "ALLOW:" There is no such command.


Sitemap: yoursite(.)com/sitemap.xml

This is one of the most important aspects of the robots.txt file. Place your xml, rss or txt sitemap file in the root directory and edit this line to that path. When a spider locates the path to your sitemap it drops the crawl and will go straight for the file. The sitemap file will guarantee a much deeper crawl of your pages resulting in a more complete indexing which will better your organic results translating to more traffic.

sitemap.txt: Text formatted sitemaps are limited to 50,000 links or 10 megs in size. You must break your sitemaps up into smaller files if you have a large site with more than 50,000 links.


The bottom line:
The main rule is to keep the robots.txt file simple and easy for the search engines to follow. The above example file includes KNOWN dirty spiders/crawlers that you should avoid. One of the biggest problems with dirty spiders is their lack of accountability. If you've ever loaded your website just to have it sit for a minute or twenty, the chances are very good that you may have a dirty spider on your site doing a massive crawl. These crawls can result in the server resources getting hammered and a zero response from your front end.

Hope this helps you a bit
Cheers

0

Thank you Sir. That was a very elaborate Explanation and would be useful for so many of us. I would use your Robots Code.

One Last Doubt. Which Meta Robots tag of the 3 do you suggest I must use along with my Robots.txt file as suggested by you.

Also How do I perform this step ( Adding the robots meta tag to your main ) for a Wordpress based site. Should I place this in the header.php or index.php or.. Not sure for a wordpress based site.

Please Help.

0

HI there, unfortunately I'm a Joomla geek. I'm fairly certain that in wordpress it's index.html if youre running a static main page, but don't quote me on that. I'll have to install wordpress and play with it soon. You'd place this between the <head> and </head> tags just as your other meta tags. Unless there's a specific reason why you'd want to use the other tags the default - <META NAME="ROBOTS" CONTENT="INDEX, FOLLOW"> - should be just fine.

Cheers

0

stevenh,

I would like to thank you as well for taking the time to submit such a detailed response. Your post has helped considerably and prompted a bit of research on my part into the subject of bots and the use of a robot.txt file.

I have located a couple of extensive listings of bots, one in particular residing at http://www.botsvsbrowsers.com/, but can someone tell me how you determine a beneficial bot from some sort of undesireable.

0

I think I may have found my answer. The database at User-Agents.org was very helpful.

Again, thank you for your response in the thread. It provided the motivation I needed to find the answers on my own.

This is all still fairly new to me. My primary concerns at this point are with spam and email harvesting. This is what I've come up with.


User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

# Other bots not allowed

User-agent: 8484 Boston Project
Disallow: /

User-agent: Atomic_Email_Hunter
Disallow: /

User-agent: autoemailspider
Disallow: /

User-agent: bwh3_user_agent
Disallow: /

User-agent: China Local Browse 2.6
Disallow: /

User-agent: ContactBot/0.2
Disallow: /

User-agent: ContentSmartz
Disallow: /

User-agent: DataCha0s/2.0
Disallow: /

User-agent: DBrowse 1.4b
Disallow: /

User-agent: Demo Bot DOT 16b
Disallow: /

User-agent: Demo Bot Z 16b
Disallow: /

User-agent: DSurf15a*
Disallow: /

User-agent: EBrowse 1.4b
Disallow: /

User-agent: Educate Search VxB
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: EmailSpider
Disallow: /

User-agent: EmailWolf 1.00
Disallow: /

User-agent: ESurf15a 15
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: Franklin Locator 1.8
Disallow: /

User-agent: FSurf15a 01
Disallow: /

User-agent: Full Web Bot*
Disallow: /

User-agent: NameOfAgent (CMS Spider)
Disallow: /

User-agent: NASA Search 1.0
Disallow: /

User-agent: Nsauditor/1.x
Disallow: /

User-agent: PBrowse 1.4b
Disallow: /

User-agent: PEval 1.4b
Disallow: /

User-agent: Poirot
Disallow: /

User-agent: Port Huron Labs
Disallow: /

User-agent: Production Bot*
Disallow: /

User-agent: Program Shareware 1.0.2
Disallow: /

User-Agent: Progressive Download*
Disallow: /

User-agent: PSurf15a*
Disallow: /

User-agent: psycheclone
Disallow: /

User-agent: RSurf15a*
Disallow: /

User-agent: searchbot admin@google.com
Disallow: /

User-agent: ShablastBot 1.0
Disallow: /

User-agent: snap.com beta crawler v0
Disallow: /

User-agent: Snapbot*
Disallow: /

User-agent: SnoopRob*
Disallow: /

User-agent: Snoopy*
Disallow: /

User-agent: sogou*
Disallow: /

User-agent: Sogou*
Disallow: /

User-agent: sohu*
Disallow: /

User-agent: SSurf15a*
Disallow: /

User-agent: TSurf15a*
Disallow: /

User-agent: Under the Rainbow 2.2
Disallow: /

User-agent: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Disallow: /

User-agent: VadixBot
Disallow: /

User-agent: VayalaCreep-v0.0.1
Disallow: /

User-agent: Vayala|Creep-v0.0.1
Disallow: /

User-agent: Webdup/0.9
Disallow: /

User-agent: webhack
Disallow: /

User-agent: WebVulnCrawl*
Disallow: /

User-agent: Wells Search II
Disallow: /

User-agent: WEP Search 00
Disallow: /

User-agent: WinampMPEG/2.00*
Disallow: /

User-agent: www4mail*
Disallow: /

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.