I'm noticing Googlebot is not respecting my robots.txt. I'm seeing Googlebot's user agent crawling pages that have been in my robots.txt file for many months. Some of them are showing up in GSC as "Indexed, though blocked by robots.txt" with Last crawled dates indicated as recent as yesterday.

Additionally, I'm seeing Googlebot crawl my robots.txt file a few times a day, and the URLs are definitely blocked per the Google robots.txt tester.

My robots.txt is in the following format:

Sitemap: ...

User-agent: *

# ...

Disallow: ...
Disallow: ...
etc. ~ 40 lines

# ...

Disallow: ...
Disallow: ...
etc. ~ 60 lines

# ...

Disallow: ...
Disallow: ...
etc. ~ 20 lines

Recommended Answers

All 12 Replies

I read the GSC Live Test tool does test for blocking by robots.txt. Maybe that's next.

commented: Have to chuckle here that you used GSC without clarification but a google search comes up as "girl scout cookies" +34

Per the OP post:

Additionally, I'm seeing Googlebot crawl my robots.txt file a few times a day, and the URLs are definitely blocked per the Google robots.txt tester.

Time for Google's own to give up ideas. Given the Google Seach Console (GSC) gives it a passing grade tells me that that likelihood of fake googlebot requests just went up even if the useragent is legitimate.

Sorry for not defining GSC first. I'll work harder on that.

Sorry for not defining GSC first. I'll work harder on that.

Nono, no worries at all. I actually think it would be annoying if well-understood acronyms were constantly explained and defined. If every question redfined what GSC or GWT or GMB is, it would come off spammy, the same way all those low quality spammers saying "SEO is search engine optimization" comes off spammy.

I was just giggling because you just accused someone else for doing the exact same thing.

Time for Google's own to give up ideas.

This isn't the type of thing that Google representatives will officially comment on. This is the type of thing that's often asked and discussed ad nauseam within SEO forums such as this one, WebmasterWorld, BlackHatWorld, DigitalPoint, etc. :)

commented: Then more clues are needed. Detective work. +15

The hope is that a seasoned SEO comes across this thread who has experienced something similar with one of their clients.

Then more clues are needed. Detective work.

That pretty much sums up the entire SEO industry in a nutshell.

Just for clarity. I read yet another in the way too many discussions on this and the original poster was upset that no one understood their question. Their concern was not that Google ignored the robots.txt but their pages were not being indexed. It was quite confusing since they lead with google ignores my robots.txt and so everyone was off track.

What they really wanted was for others to find out what lines in the robots.txt caused the blocking message. But by the time they cleared that up (this was on a Reddit) no one was going to help because original poster had flamed everyone that they didn't understand what they were asking.

Me? I misunderstand the questions at times as well.

Comment. I can't wait to hear about your child when they enter the "Why?" phase.

What they really wanted was for others to find out what lines in the robots.txt caused the blocking message.

So basically they asked why G is ignoring their robots.txt when what they really meant was why G is adhering to their robots.txt. ;)

Comment. I can't wait to hear about your child when they enter the "Why?" phase.

I was in that phase all through college. If my offspring takes after me, you're going to be waiting awhile.

While researching ideas to improve crawl budget, I stumbled upon this article over at Search Engine Land.

It alludes to the idea that:

Google bot may assume you've made a mistake if you disallow lots of content or if a restricted page receives a lot of incoming links and it may still crawl these pages.

In other words, because our robots.txt is rather large, and blocks a significant chunk of our pages, Googlebot may think that we made a mistake and didn't mean to block so much content, so they're ignoring our robots.txt in an effort to do what they think is in our best interests.

However, it's possible the author of the article misspoke, and is confusing crawling with indexing. It's well known that if a restricted page receives a lot of incoming links, it may still be indexed albeit not crawled. In such cases, the naked, description-less links will show up in the Google search results for highly relevant searches.

commented: Nice find. That may make this testable. I feel for those that do such testing. +15

I feel for those that do such testing.

That's the big advantage to outsourcing your SEO efforts to an SEO agency. I might know just as much, or even more, than the best agencies do about SEO. But where they will always have me beat, hands down, is access to tons of data across all of their many different clients' sites. They know what worked here, and what didn't work there, and can apply that constantly changing knowledge to new clients. (Because, of course, what worked two years ago is not going to be the same thing that worked today.)

That's why most SEO agencies stick to a specific niche. For example, some SEO agencies focus entirely on local SEO, or on sites within a specific industry. The more similar all your clients, the more you can use lessons observed from one client to the benefit of all the others.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.