Block malicious links via robots.txt

Question

useruno1 39 Newbie Poster

6 Years Ago

Hello guys,
Recently my site was infected with malware, which caused me a lot of problems. In particular, many spam links have been created and indexed. I managed to get a lot out of them with Google search console, but it still appears in some key searches. Is there any chance of blocking the link prefix in robots.txt to deleting itself from google?

<snip>

I want somehow to block indexing all links

I know i can block like this:
User-agent: *
Disallow: /product/categories

But this one is different, its not like a parent page/category. I would appreciate very much if you can help me, cheers!

Edited 6 Years Ago by Reverend Jim because: Removed spam links

3 Contributors
15 Replies
4K Views
6 Days Discussion Span
Latest Post 6 Years Ago Latest Post by useruno1

All 15 Replies

Dani 4,675 The Queen of DaniWeb

6 Years Ago

I think he’s trying to show the format of the links to see if there is regex or something that can be used to mass deindex them. I believe wildcard characters are included. I can’t provide more advice without seeing the format of the original links that were snipped unfortunately.

Dani 4,675 The Queen of DaniWeb

6 Years Ago

rproffitt, I still believe that robots.txt is the best solution here.

It seems as if malware has created many spammy pages with have subsequently been indexed by Google. The article you links to suggests the best way to deindex pages that you want Google to still be able to crawl and want visitors to access. In such a case, I would agree with John Mueller, who is Google's ambassador to the SEO community.

However, I would not recommend that strategy in this case. Basically the strategy involves manually adding a <meta robots=noindex> tag to every page, and then updating the sitemap to tell Google to recrawl the page soon to notice the change.

The problem with doing that, in this case, is firstly, I would hope that the spammy pages have already been permanently removed. Secondly, if for some reason they haven't been, manually modifying every spammy page doesn't seem like a reasonable solution ... if it were that easy, one would just as easily be able to remove the pages altogether or change their content to be non-spammy.

Instead, a robots.txt file is the best way to quickly tell Google to not crawl that section of the site. This is imperitive when the pages are spammy, and you don't want Googlebot to ding you for having spammy pages on your domain. If the pages no longer exist, they'll eventually fall out of the index over time, don't worry about that.

Dani 4,675 The Queen of DaniWeb

6 Years Ago

And unfortunately that brings us back to how the original post was snipped to remove a crucial part of the question. It had some foul language as well as links to spammy pages, so I'll try to use example URLs instead:

Ex:
https://www.example.com?foo=bar+html
https://www.example.com/?foo=bar.html + many more

I want somehow to block indexing all links from:
"foo="

Check out https://geoffkenyon.com/how-to-use-wildcards-robots-txt/ and you can see there that you can do something like:

User-agent: *
Disallow: *?foo=

(At least, I think so. Please correct me if I'm wrong.)

rproffitt commented: That robots.txt file looks correct to me. But will not deindex from what I've read. +15

Dani 4,675 The Queen of DaniWeb

6 Years Ago

That robots.txt file looks correct to me. But will not deindex from what I've read.

I think you've misunderstood what I was saying. A robots.txt file, alone, will not deindex. It was imperitive that userunfo was able to get all the pages to return a 410 Gone HTTP status. The advantage to robots.txt is that Googlebot won't be able to crawl spammy URLs on a domain, consider them spammy, and negatively affect the quality score of the domain as a whole. Therefore, it helps preserve the integrity of the domain (which can take months or years to recover from) while figuring out how to 410.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

That robots.txt file looks correct to me. But will not deindex from what I've read.

rproffitt 2,706 https://5calls.org Moderator · Answer 1 · 2019-05-27T20:44:52+00:00

I'm going with no since robots do not have to honor this file.
Noted at http://www.robotstxt.org/faq/blockjustbad.html

On top of that your links don't show me the issue. In fact they seem more like forum spam to me. Tell me what I should be seeing at those links.

rproffitt 2,706 https://5calls.org Moderator · Answer 2 · 2019-05-27T23:12:58+00:00

My point here is robots.txt doesn't appear to be the answer. Looking around for more clarity I found this discussion in which Google's own webmaster John Mueller notes how to mass deindex.

"using a robots.txt won’t remove pages from Google’s index." was his point and again why I wrote no.

The full discussion is at https://www.seoinc.com/seo-blog/fastest-way-to-deindex-pages/ and IMO should be the fine answer on how to deindex where needed.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 3 · 2019-05-28T04:06:31+00:00

"using a robots.txt won’t remove pages from Google’s index." was his point and again why I wrote no.

rproffitt, your quote is taken out of context. robots.txt will not remove valid pages from Google's index. If you have a webpage that you want visitors to be able to access, but you simply don't want it to appear in Google's index, then adding it to your robots.txt file won't remove it from the index.

However, when dealing with the issue the OP is suffering from, this is not the solution to the problem. He needs to make sure all the spammy pages no longer exist on his domain, and then use robots.txt immediately in order to do damage control so as to not get hit by a Google algorithm update that doesn't like domains with lots of spammy pages.

rproffitt 2,706 https://5calls.org Moderator · Answer 4 · 2019-05-28T04:32:37+00:00

Thanks Dani.

Sites with spammy content will suffer so I think we're on the same track there.

If useruno1 wants to keep spammy pages that's up to them and let's hope that what's been covered here can clean it up.

useruno1 39 Newbie Poster · Answer 5 · 2019-05-28T07:28:46+00:00

Thank you guys for discussing my problem, I read everything you wrote above but I still do not know what solution is more optimistic, I've removed over +1000 links manually through google search consoles but.. stil cant find a good option to bulk them. I've attached some of the files I've found infesting in wordpress.

Is there a solution to write a code so that can all links from www.website.com/?dic to return a 410?

https://ufile.io/r8uayfsz >> here are some of the files infested.

. This link dosent work :(: https://www.seoinc.com/seo-blog/fastest-way-to-deindex-pages/

useruno1 39 Newbie Poster · Answer 6 · 2019-05-28T08:13:05+00:00

I managed to HTTP ERROR 410 all /?foo=" links. Let's hope this is gonna solve everthing.

rproffitt 2,706 https://5calls.org Moderator · Answer 7 · 2019-05-28T17:13:56+00:00

https://www.seoinc.com/seo-blog/fastest-way-to-deindex-pages/ is working here but I'm in the USA using a Google DNS 8.8.8.8 so folk in China, North Korea and who knows where else may not be able to get there.

useruno1 39 Newbie Poster · Answer 8 · 2019-05-31T07:01:47+00:00

I managed to clean a lot of links with 410 redirect, there are a few more but I will check these days and see if they all disappear. I appreciate your help! I find this thing as only solution and i kinda work, i guess.. :)

if(isset($_GET['foo'])){
    header("HTTP/1.0 410 Gone");
    exit();

useruno1 39 Newbie Poster · Answer 9 · 2019-05-31T09:30:49+00:00

Sorry for reply, guys.. any idea how can I ping all links from search engine by a crawlbot or something? In order to let them know(bots) that is 410 redirect, and they should remove it.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 10 · 2019-05-31T15:57:03+00:00

Sorry, I don’t know how to do that off the top of my head. If the page contents have changed, you can use a sitemap file. But I don’t think googlebot wants your sitemap to contain dead pages. I think just naturally wait for them to come around and recrawl you.

useruno1 39 Newbie Poster · Answer 11 · 2019-06-03T07:19:24+00:00

I appreciate your help very much, A++ tech community.

Block malicious links via robots.txt

Recommended Answers Collapse Answers

All 15 Replies

Recommended Answers