Digital Publishers:

Do you block AI bots either via robots.txt, with HTTP 403s, or both? Why or why not?

Currently on the fence about it and looking to get opinions from other publishers as to their reasoning and how it's influenced their SEO, etc.

As I have mentioned before here, my company has its own (interconnected) firewall system on our servers. Currently, we don't have a strict policy for AI bots, but we do have a strict policy against flooding requests, especially when we see that the ISP behind the request's IP is either a VPN, transit, or data-collecting company.

I've realized that some of these data-collecting/scraping companies might be linked to AI bots, but I can't say it with 100% certainty. However, I have also seen the opposite: requests from IPs that "smell" like AI bots can be perfectly valid and have reasonable timing intervals (of a few milliseconds). In such cases, we let them get the information they want.

Most of our clients are e-shops or professional presentation sites. This means that when AI bots/data-collecting companies behave reasonably, we let them scrape whatever they want because real site visits from AI chatbots are increasing (although, in the data I have, not as much as has been reported).

We don't have a "forum" client, so I can't really speak about that, but we do have some news portals. We have talked with them about this, and so far, they don't face a significant problem from AI bots stealing content. However, we've already grouped them onto different servers so that we can apply different firewall rules in the future to aggressively block AI bots or related companies from accessing their content.

Dani, are you talking about bots that are collecting data to feed an LLM? This is what I was referring to. If so, how do you really know? In my limited experience, I have to research each IP and the data collection/scraping company to see if it has a connection to a known LLM. Is there an easier way?

Or are you talking about bots that are AI-driven?

We use Cloudflare, which I'm not sure if you use or are familiar with, but they have a feature called AI audit that analyzes AI crawlers. For example, I can see how many requests were made by Amazon Alexa's bot, ChatGPT's bot, Apple Siri's bot, etc.

You can block any one or many of these crawlers by specifying their name in your robots.txt file. For good measure, Cloudflare also gives you the option of blocking them with an HTTP 403 error if they are misbehaving and not respecting your robots.txt (which, in my experience, they do adhere to).

I would never block any specific IP or IP range for any long duration of time, but I think that blocking specific user agents makes sense server-side, and blocking specific bots via robots.txt is the easiest way since the AI bots are self-identifying.

As of now, I don't block any AI bots, but I do flood protection, just as you do. I have been on the fence about blocking AI bots because, as you say, on one hand it might be the future of getting web traffic, but on the other hand, they steal your content.

See attached for Cloudflare's report.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.