Hiya,

I need to know something.
First I thought Sitemap Xml files will list all .html and .hml and .shtm and .shtml files. All pages of the website.
But now I see, Sitemap xml files also list other xml files. Check this one out for what I mean:
https://www.rocktherankings.com/sitemap_index.xml

So that means, I got to program my web crawler to go one level deep to find the site links (html files).
Question is: Does this happen more than one level deep ?
I mean does it do this ....
I go to a Sitemap xml file.
I see further xml files. I clickover to an xml file. Thus go one level deep.
I see more xml files listed. I clickover to an xml file. Thus go two level deep.

How many levels deep can a site go like this to list their html files ?
I need to know this to program my crawler how many levels it should check before giving up. Do not want to be going in an endless loop and get my crawler get into a trap.

Recommended Answers

All 4 Replies

There can be an infinite number of levels of sitemap index files pointing to sitemap index files pointing to sitemap files.

Luckily, the BJ_Crawler class you described over in your other thread already uses recursion to currently handle an unlimited depth, so you don't need to handle this on your end. It's already being taken care of for you.

@Dani

How to cloe this thread ?

You can mark it as solved by toggling the little green switch next to the reply button. I’ll go ahead and do that for you.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.