Googlebot ignores robots.txt

Question

Dani 4,084 The Queen of DaniWeb

3 Years Ago

I'm noticing Googlebot is not respecting my robots.txt. I'm seeing Googlebot's user agent crawling pages that have been in my robots.txt file for many months. Some of them are showing up in GSC as "Indexed, though blocked by robots.txt" with Last crawled dates indicated as recent as yesterday.

Additionally, I'm seeing Googlebot crawl my robots.txt file a few times a day, and the URLs are definitely blocked per the Google robots.txt tester.

My robots.txt is in the following format:

Sitemap: ...

User-agent: *

# ...

Disallow: ...
Disallow: ...
etc. ~ 40 lines

# ...

Disallow: ...
Disallow: ...
etc. ~ 60 lines

# ...

Disallow: ...
Disallow: ...
etc. ~ 20 lines

robots.txt

2 Contributors
12 Replies
218 Views
2 Days Discussion Span
Latest Post 3 Years Ago Latest Post by Dani

All 12 Replies

rproffitt 2,572 "Nothing to see here."

3 Years Ago

I'm hearing more about fake googlebot requests. Maybe that?

PS. Google those 3 words to find out more. Also:
https://support.google.com/webmasters/answer/80553

Edited 3 Years Ago by rproffitt because: Add PS.

rproffitt 2,572 "Nothing to see here."

3 Years Ago

I read the GSC Live Test tool does test for blocking by robots.txt. Maybe that's next.

Dani commented: Have to chuckle here that you used GSC without clarification but a google search comes up as "girl scout cookies" +34

rproffitt 2,572 "Nothing to see here."

3 Years Ago

Time for Google's own to give up ideas. Given the Google Seach Console (GSC) gives it a passing grade tells me that that likelihood of fake googlebot requests just went up even if the useragent is legitimate.

Sorry for not defining GSC first. I'll work harder on that.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Have to chuckle here that you used GSC without clarification but a google search comes up as "girl scout cookies"

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 1 · 2020-08-03T21:02:58+00:00

Already aware of https://support.google.com/webmasters/answer/80553

For argument's sake, it could be verified that the useragent is legitimate. Original question still holds.

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 2 · 2020-08-03T22:12:32+00:00

Per the OP post:

Additionally, I'm seeing Googlebot crawl my robots.txt file a few times a day, and the URLs are definitely blocked per the Google robots.txt tester.

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 3 · 2020-08-03T23:38:31+00:00

Sorry for not defining GSC first. I'll work harder on that.

Nono, no worries at all. I actually think it would be annoying if well-understood acronyms were constantly explained and defined. If every question redfined what GSC or GWT or GMB is, it would come off spammy, the same way all those low quality spammers saying "SEO is search engine optimization" comes off spammy.

I was just giggling because you just accused someone else for doing the exact same thing.

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 4 · 2020-08-04T19:47:47+00:00

Time for Google's own to give up ideas.

This isn't the type of thing that Google representatives will officially comment on. This is the type of thing that's often asked and discussed ad nauseam within SEO forums such as this one, WebmasterWorld, BlackHatWorld, DigitalPoint, etc. :)

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 5 · 2020-08-04T20:39:03+00:00

The hope is that a seasoned SEO comes across this thread who has experienced something similar with one of their clients.

Then more clues are needed. Detective work.

That pretty much sums up the entire SEO industry in a nutshell.

rproffitt 2,572 "Nothing to see here." Moderator · Answer 6 · 2020-08-04T21:30:03+00:00

Just for clarity. I read yet another in the way too many discussions on this and the original poster was upset that no one understood their question. Their concern was not that Google ignored the robots.txt but their pages were not being indexed. It was quite confusing since they lead with google ignores my robots.txt and so everyone was off track.

What they really wanted was for others to find out what lines in the robots.txt caused the blocking message. But by the time they cleared that up (this was on a Reddit) no one was going to help because original poster had flamed everyone that they didn't understand what they were asking.

Me? I misunderstand the questions at times as well.

Comment. I can't wait to hear about your child when they enter the "Why?" phase.

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 7 · 2020-08-04T22:54:48+00:00

What they really wanted was for others to find out what lines in the robots.txt caused the blocking message.

So basically they asked why G is ignoring their robots.txt when what they really meant was why G is adhering to their robots.txt. ;)

Comment. I can't wait to hear about your child when they enter the "Why?" phase.

I was in that phase all through college. If my offspring takes after me, you're going to be waiting awhile.

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 8 · 2020-08-06T17:55:28+00:00

While researching ideas to improve crawl budget, I stumbled upon this article over at Search Engine Land.

It alludes to the idea that:

Google bot may assume you've made a mistake if you disallow lots of content or if a restricted page receives a lot of incoming links and it may still crawl these pages.

In other words, because our robots.txt is rather large, and blocks a significant chunk of our pages, Googlebot may think that we made a mistake and didn't mean to block so much content, so they're ignoring our robots.txt in an effort to do what they think is in our best interests.

However, it's possible the author of the article misspoke, and is confusing crawling with indexing. It's well known that if a restricted page receives a lot of incoming links, it may still be indexed albeit not crawled. In such cases, the naked, description-less links will show up in the Google search results for highly relevant searches.

Nice find. That may make this testable. I feel for those that do such testing.

Dani 4,084 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 9 · 2020-08-06T18:23:54+00:00

I feel for those that do such testing.

That's the big advantage to outsourcing your SEO efforts to an SEO agency. I might know just as much, or even more, than the best agencies do about SEO. But where they will always have me beat, hands down, is access to tons of data across all of their many different clients' sites. They know what worked here, and what didn't work there, and can apply that constantly changing knowledge to new clients. (Because, of course, what worked two years ago is not going to be the same thing that worked today.)

That's why most SEO agencies stick to a specific niche. For example, some SEO agencies focus entirely on local SEO, or on sites within a specific industry. The more similar all your clients, the more you can use lessons observed from one client to the benefit of all the others.

Googlebot ignores robots.txt

Recommended Answers Collapse Answers

All 12 Replies

Recommended Answers