I am still at procedural style programming. And so, oop code samples confuse me.
I found this oop style. Any chance you can show me how to convert this to procedural style ?
Test the code. It woks fine!
https://bytenota.com/parsing-an-xml-sitemap-in-php/

// sitemap url or sitemap file
$sitemap = 'https://bytenota.com/sitemap.xml';

// get sitemap content
$content = file_get_contents($sitemap);

// parse the sitemap content to object
$xml = simplexml_load_string($content);

// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement) {
    // get properties
    $url = $urlElement->loc;
    $lastmod = $urlElement->lastmod;
    $changefreq = $urlElement->changefreq;
    $priority = $urlElement->priority;

    // print out the properties
    echo 'url: '. $url . '<br>';
    echo 'lastmod: '. $lastmod . '<br>';
    echo 'changefreq: '. $changefreq . '<br>';
    echo 'priority: '. $priority . '<br>';

    echo '<br>---<br>';
}

Recommended Answers

All 26 Replies

Programmers,

I also need help to convert this to procedural style php:

 include_once('simplehtmldom_1_9_1/simple_html_dom.php');
  //---
  $url = "https://www.rocktherankings.com/post-sitemap.xml";
  $html = new simple_html_dom();
  $html->load_file($url);
  //--
  foreach($html->find("loc") as $link)
  {
    echo $link->innertext."<br>";
  }

Once I have managed to convert them to proceural style, with yourhelp, I can then further ad more codes of my own so the scripts save the extracted links to mysql database.
Good idea ?

@dani

Any chance you can show me how to convert the 2nd script, in this thread, to procedural style php ?
From your converted code, I will see the difference or detect the differences between the two styles and learn the syntax. Then, I will try attempting to convert the first code myself mentioned on my original post.
Good idea ?
That way, once I have learnt the syntax then I won't have to bug you people in this forun anymore and can do the conversions myself whenever I find any code samples online.
Good idea ?

Nowadays, modern PHP code tends to be very OOP-focused, so I think the long term goal should be to step-by-step learn OOP so that you can come to appreciate the many benefits it offers, and be more comfortable when you end up exposed to it in the wild. (Which will undoubtedly keep happening if you continue with PHP). Please don't be overwhelmed. It's not as scary as it looks!

The concept behind OOP is that the same way a variable can hold a string or a number, it can also hold an object. Those objects have properties. Those objects also have functions that can be acted upon them. Classes are used to define types of objects.

For example, take this code from this other question of yours:

// Create a new variable called $doc that is of type DOMDocument (an instance of the DOMDocument class)
$doc = new DOMDocument();

// DOMDocument objects have different built-in functions
// One such function of DOMDocuments is loadHTML()
// Upon calling it, the $doc object now has full access to the HTML stored in one of its properties
$doc->loadHTML($message);

// Another function of DOMDocuments is getElementsByTagName()
// This function returns a variable of a different type, DOMNodeList
// $links is now a DOMNodeList object
$links = $doc->getElementsByTagName('a');

// One of the properties of DOMNodeLists is that they have a length
// We can fetch the length of the $links DomNodeList
if ($links->length > 0)
{
    // We can also loop through DomNodeLists just as if they were arrays
    // Each $link is a DomNode object
    foreach ($links AS $link)
    {
        // DomNode objects have a function called setAttribute() in which you can modify their attributes
        $link->setAttribute('class', 'link-style');
    }
}

I know this probably isn't what you want to hear, but not all OOP code can easily be converted to a procedural equivalent. Especially in cases like this where we are dealing with multiple object types, each with properties and class functions.

@dani

Thanks for your life's first crawler code here:
https://www.daniweb.com/programming/web-development/threads/538867/php-xml-sitemap-crawler-tutorial-sought

But I am afraid the code is a bit too much for me right now, on that one on some places. Will have to look into that one once a bit more knowledgeable on php.
As of now, can you see the crawler code on my original post on this thread ? That code I got from a tutorial and it assumes the Sitemap xml file (starting point of the crawl) is listing no further xml files but html links.

Now the xml sitemap I was working on had more xml sitemaps listed.
https://www.rocktherankings.com/sitemap_index.xml
And those other more xml sitemaps were then listing the html files of the site. That means, the code on my original post was not working and was showing blank page as I have to write more code for the crawler to go one level deep to find the site's html files. So, the crawler should start on an xml file. Find more xml files on it and then visit those xml files to finally find the html links.
Now, look at this modification of the code you see on my original post:

$extracted_urls = array();
$crawl_xml_files = array();

// sitemap url or sitemap file
$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
//$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.

// get sitemap content
$content = file_get_contents($sitemap);

// parse the sitemap content to object
$xml = simplexml_load_string($content);

// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement) 
{
    echo __LINE__; echo '<br>'; //DELETE IN DEV MODE

    $path = $urlElement;
    $ext = pathinfo($path, PATHINFO_EXTENSION);
    echo 'The extension is: ' .$ext; echo '<br>'; //DELETE IN DEV MODE

    echo __LINE__; echo '<br>'; //DELETE IN DEV MODE
    echo $urlElement; //DELETE IN DEV MODE

    if($ext=='xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
    {
        echo __LINE__; echo '<br>'; //DELETE IN DEV MODE

        $crawl_xml_files[] = $url;
    }
    elseif($ext=='html' || $ext=='htm' || $ext=='shtml' || $ext=='shtm' || $ext=='php' || $ext=='py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
    {
        echo __LINE__; echo '<br>'; //DELETE IN DEV MODE

        $extracted_urls[] = $extracted_url;

        // get properties of url (non-xml files)
        $extracted_urls[] = $extracted_url = $urlElement->loc;
        $extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod;
        $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq;
        $extracted_priorities[] = $extracted_priority = $urlElement->priority;
    }
}

print_r($crawl_xml_files); echo '<br>'; //DELETE IN DEV MODE
echo count($crawl_xml_files); echo '<br>'; //DELETE IN DEV MODE


if(!EMPTY($crawl_xml_files))
{
    foreach($crawl_xml_files AS $crawl_xml_file)
    {
        // Further sitemap url or sitemap file
        $sitemap = "$crawl_xml_file"; //Has more xml files.

        // get sitemap content
        $content = file_get_contents($sitemap);

        // parse the sitemap content to object
        $xml = simplexml_load_string($content);

        // retrieve properties from the sitemap object
        foreach ($xml->url as $urlElement)
        {
            $path = $urlElement;
            $ext = pathinfo($path, PATHINFO_EXTENSION);
            echo 'The extension is: ' .$ext; echo '<br>'; //DELETE IN DEV MODE

            echo $urlElement; //DELETE IN DEV MODE

            if($ext=='xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
            {
                echo __LINE__; echo '<br>'; //DELETE IN DEV MODE

                $crawl_xml_files[] = $url;
            }
            elseif($ext=='html' || $ext=='htm' || $ext=='shtml' || $ext=='shtm' || $ext=='php' || $ext=='py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
            {
                echo __LINE__; echo '<br>'; //DELETE IN DEV MODE

                $extracted_urls[] = $extracted_url;

                // get properties of url (non-xml files)
                $extracted_urls[] = $extracted_url = $urlElement->loc;
                $extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod;
                $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq;
                $extracted_priorities[] = $extracted_priority = $urlElement->priority;
            }
        }
    }
}

echo __LINE__; echo '<br>'; //DELETE IN DEV MODE

//Display all found html links.
print_r($extracted_urls); //DELETE IN DEV MODE
echo '<br>'; //DELETE IN DEV MODE
print_r($extracted_last_mods); //DELETE IN DEV MODE
echo '<br>'; //DELETE IN DEV MODE
print_r($extracted_changefreqs); //DELETE IN DEV MODE
echo '<br>'; //DELETE IN DEV MODE
print_r($extracted_priorities); //DELETE IN DEV MODE
echo '<br>'; //DELETE IN DEV MODE

It does not work. I get this echoed:

48

157

172

188

205

231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
Array ( )
0
308
Array ( )

Warning: Undefined variable $extracted_last_mods in C:\Program Files\Xampp\htdocs\Work\buzz\Templates\crawler_Test.php on line 313

Warning: Undefined variable $extracted_changefreqs in C:\Program Files\Xampp\htdocs\Work\buzz\Templates\crawler_Test.php on line 315

Warning: Undefined variable $extracted_priorities in C:\Program Files\Xampp\htdocs\Work\buzz\Templates\crawler_Test.php on line 317

Where you think I went wrong ? On which particular lines ?
Any chance, if you get the time, then if you do not mind, can you fix it as much as you can ? Best to turn it to do what your first-time crawler is doing on that other thread of mine and that is getting the crawler get crawling on an infinite levels in the xml files until it finds html files or html links. I can then compare your fix with my buggy one and learn from you.

Remember, I am trying to build the crawler on the skeleton of the code you see on my original post as I do understand that one's code without much trouble.
Skeleton of this tutorial code:
https://bytenota.com/parsing-an-xml-sitemap-in-php/

And so working on the code that I do understand. I hope you understand.
Atleast, if someone can point me out where I am going wrong then I reckon I can fix from then on. Right now, I am scratching my head. I get the feeling it's failing to scrape the found xml links and it's failing to spot the right extensions of the found links.
Hence the undefined variable errors.

Thanks

The code that you are starting with is procedural. It's simply incapable of doing what you're requesting. The only way to accomplish what you're wanting is with multiple functions and classes. The reason is that you need to break the code up into functions (a section of code with inputs, known as parameters, and an output, known as a return value) in order to use recursion, which is when a function calls itself. You need to use recursion because the logic you need is as so:

A function that does the following:

  • load file
  • determine if file is a sitemap index or a sitemap file
  • if a sitemap file, process its URLs
  • if a sitemap index, call this function again on the sitemap index (this is called recursion, where a function calls itself)

It is a futile effort to do what you're trying to do. If you are passionate about this project, I suggest that you do some more research on recursion and classes, and then follow the code that I provided, that does exactly what you're asking in the most clean and efficient and simplest way possible.

In order for me to take your latest code and get it working, I would have to add functions, recursion, and classes. There's a reason these concepts exist in programming. The same way you are limited to what your code can do without using variables, or loops, you are limited to what your code can do without using classes and recursion.

You're only willing to learn half the alphabet and then getting frustrated that your sentences aren't making sense. Then you're asking me to rewrite your sentences for you but insisting I only use the half of the alphabet that you've already learned. I'm basically telling you that I can fix your sentences, but only if you allow me to use the complete alphabet.

@dani

I have been testing and re-testing. And I have found-out that this line is failing:

 // retrieve properties from the sitemap object
        foreach ($xml->url as $urlElement)
        {
            $path = $urlElement; //THIS LINE IS FAILING TO EXTRACT THE EXTENSION
            $ext = pathinfo($path, PATHINFO_EXTENSION);
            echo 'The extension is: ' .$ext; echo '<br>'; //DELETE IN DEV MODE

Q1.
Do you atleast see any errors on my code to extract the file extension ? If so, then if you can kindly fix this part on my previous post's code then that code should work which is failing now.

Q2.
You can build on the skeleton of this following code using the complete alphabet ?

//Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them.
//Sitemap Protocol: https://www.sitemaps.org/protocol.html

// sitemap url or sitemap file
$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.

// get sitemap content
$content = file_get_contents($sitemap);

// parse the sitemap content to object
$xml = simplexml_load_string($content);

// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement) 
{
    // get properties
    $url = $urlElement->loc;
    $lastmod = $urlElement->lastmod;
    $changefreq = $urlElement->changefreq;
    $priority = $urlElement->priority;

    // print out the properties
    echo 'url: '. $url . '<br>';
    echo 'lastmod: '. $lastmod . '<br>';
    echo 'changefreq: '. $changefreq . '<br>';
    echo 'priority: '. $priority . '<br>';

    echo '<br>---<br>';
}

Be my guest if that code is oop. I understand the code, hence fussing over it.
But if it's not oop then be my guest to build over this other tutorial code that I understand and I believe it is oop:

//Sitemap Crawler: If starting url is an xml file listing further xml files then it will just echo the found xml files and not extract links from them.
//Sitemap Protocol: https://www.sitemaps.org/protocol.html

include_once('simplehtmldom_1_9_1/simple_html_dom.php');

$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.

$html = new simple_html_dom();
$html->load_file($sitemap);

foreach($html->find("loc") as $link)
{
    echo $link->innertext."<br>";
}

If both codes are oop, then be my guest building over both. That way, we newbies learn to build a crawler in two different ways. Try building them both without using cURL. We can move-on to cURL later. That way, we newbies learn 4 versions from you.
This is a good idea. (You do not have a clapping hand smiley here for me to post).

Thanks a bunch!

@dani,

I realise, you mentioned my 1st code in my previous post, is not oop.
if the 2nd code is not oop too and both codes won't do for you then be my guest writing from scratch but try writing as much beginner level coding as possible so I do not struggle too much understanding the code. If you write comments on the lines then I should not be bugging you to explain the lines I fail to understand.
I'm afraid this one you wrote is over my head as it's advanced level.
https://www.daniweb.com/programming/web-development/threads/538867/php-xml-sitemap-crawler-tutorial-sought

Thanks!

If you write comments on the lines then I should not be bugging you to explain the lines I fail to understand.

I attempted to do that here and here

There is no way to accomplish what you want without OOP as well as recursion.

I know you say that one is advanced level, but I would consider it still beginner/introductory-level PHP, although it does use concepts such as recursion and OOP. However, those concepts are still typically taught in a first-semester college programming course, so I wouldn't necessarily consider them "advanced". Any code that would accomplish what you want would need to use those concepts.

I attempted to comment for you, line by line, how the BJ_Crawler class gets used. If you still don't understand, I can attempt to comment further, but I need you to first study up on what classes are, beyond the rudimentary explanation I tried to give the other day.

@dani

Ok. I did read up on classes once or twice more than 6 months ago. Then other things kept me busy and I forgot to read more on it. Will have to start all over again as forgotten what I learnt.

Please read up on it again, and then go back to my post at https://www.daniweb.com/programming/web-development/threads/538867/php-xml-sitemap-crawler-tutorial-sought#post2288470 where I tried to explain the BJ_Crawler class line by line. If you still don't understand, feel free to ask specific questions.

That class does exactly what you want in terms of processing unlimited nested sitemap index files.

@dani

Tell me one thing dani.
If vardump spits out data then should not the foreach loop do the same for the same item ?
Oop or procedural. They (vardump data and foreach loop) data are related. So, can you spot why the vardump spits out data while each of the fopreach fail ?

Here is the simple code someone gave me as a fix of my code. I just added the foreach loops while his fix ended at the vardumps:

$sitemap = 'https://bytenota.com/sitemap.xml';
    //$sitemap = 'https://www.daniweb.com/home-sitemap.xml';
    // get sitemap content
    $content = file_get_contents($sitemap);

    // parse the sitemap content to object
    $xml = simplexml_load_string($content);
    var_dump($xml);
    // Init arrays
    $crawl_xml_files = [];
    $extracted_urls = [];
    $extracted_last_mods = [];
    $extracted_changefreqs = [];
    $extracted_priorities = [];

    // retrieve properties from the sitemap object
    //foreach ($xml->url as $urlElement) {
    foreach ($xml->sitemap as $item) {
        // provide path of curren xml/html file
        //$path = $urlElement;
        $path = (string)$item->loc;
        // get pathinfo
        $ext = pathinfo($path, PATHINFO_EXTENSION);
        echo 'The extension is: ' . $ext;
        echo '<br>'; //DELETE IN DEV MODE

        echo $item; //DELETE IN DEV MODE

        if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
        {
            echo __LINE__;
            echo '<br>'; //DELETE IN DEV MODE

            //$crawl_xml_files[] = $urlElement;
            $crawl_xml_files[] = $path;
        } elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
        {
            echo __LINE__;
            echo '<br>'; //DELETE IN DEV MODE

            $extracted_urls[] = $path;

            // get properties of url (non-xml files)
            $extracted_urls[] = $extracted_url = $urlElement->loc;
            $extracted_last_mods[] = $extracted_lastmod = $item->lastmod;
            $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq;
            $extracted_priorities[] = $extracted_priority = $urlElement->priority;
        }
    }
    var_dump($crawl_xml_files);
    var_dump($extracted_urls);
    var_dump($extracted_last_mods);
    var_dump($extracted_changefreqs);
    var_dump($extracted_priorities);

 // NOTE THE ABOVE var_dumps spitout data. In that case the below foreach loops should to. But they do not!!!
    foreach($crawl_xml_files as $crawl_xml_file)
    {
        echo 'Xml File to crawl: ' .$crawl_xml_file;
    }

    echo __LINE__; //LINE 279
    echo '<br>'; //DELETE IN DEV MODE

    foreach($extracted_urls as $extracted_url)
    {
        echo 'Extracted Url: ' .$extracted_url;
    }

    echo __LINE__; //LINE 287
    echo '<br>'; //DELETE IN DEV MODE

    foreach($extracted_last_mods as $extracted_last_mod)
    {
        echo 'Extracted last Mod: ' .$extracted_last_mod;
    }

    echo __LINE__; //LINE 295
    echo '<br>'; //DELETE IN DEV MODE

    foreach($extracted_changefreqs as $extracted_changefreq)
    {
        echo 'Extracted Change Frequency: ' .$extracted_changefreq;
    }

    echo __LINE__; //LINE 303
    echo '<br>'; //DELETE IN DEV MODE

    foreach($extracted_priorities as $extracted_priority)
    {
        echo 'Extracted Priority: ' .$extracted_priority;
    }

    echo __LINE__; //LINE 307
    echo '<br>'; //DELETE IN DEV MODE

I get echoed a very .ong list of something like this:
207
object(SimpleXMLElement)#1 (1) { ["url"]=> array(528) { [0]=> object(SimpleXMLElement)#2 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [1]=> object(SimpleXMLElement)#3 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [2]=> object(SimpleXMLElement)#4 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [3]=> object(SimpleXMLElement)#5 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [4]=> object(SimpleXMLElement)#6 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [5]=> object(SimpleXMLElement)#7 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [6]=> object(SimpleXMLElement)#8 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [7]=> object(SimpleXMLElement)#9 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [8]=> object(SimpleXMLElement)#10 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [9]=> object(SimpleXMLElement)#11 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } }

At the end, look, the foreach loops echo nothing:
[527]=> object(SimpleXMLElement)#529 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } } } array(0) { } array(0) { } array(0) { } array(0) { } array(0) { } 279
287
295
303
311

Note position 1 in the vardump data. It shows url but not shows url in the other positions. Why ?
Urls do exist in other positions. You may confirm this manually:
$sitemap = 'https://bytenota.com/sitemap.xml';

Strange! Puzzling!

I'll check this out later today. I'm about to go out biking with my husband.

I’m sorry. I ended up coming down with a bad cold and I’ve been stuck in bed all day yesterday and today.

commented: Get well soon, Dani! +2

Strange! Puzzling!

Not at all, but don't despair!

So, can you spot why the vardump spits out data while each of the fopreach fail ?

The foreach fails because you are trying to traverse $xml->sitemap, which doesn't exist in $xml. That is because $xml in your code IS the sitemap object (in a sense, so is $xml->url...see below).

Here is the simple code someone gave me as a fix of my code. I just added the foreach loops while his fix ended at the vardumps:

It looks like a good start. Try this as your foreach loop:

    foreach ($xml as $item) {
        // provide path of curren xml/html file
        //$path = $urlElement;
        $path = (string)$item->loc;
        // get pathinfo
        $ext = pathinfo($path, PATHINFO_EXTENSION);
        echo $item->loc . ($ext ? ' => The extension is: ' . $ext . '<br>' : '') . '<br>'; // only show extension if one is found

        // echo '<br>'; //DELETE IN DEV MODE

        // echo $item; //DELETE IN DEV MODE
...

It should show something like this:

https://bytenota.com/ => The extension is: com

https://bytenota.com/codeigniter-create-your-first-controller/
https://bytenota.com/learn-codeigniter-tutorials/
https://bytenota.com/codeigniter-creating-a-hello-world-application/
https://bytenota.com/codeigniter-4-how-to-remove-public-from-url/
https://bytenota.com/apache-ant-delete-all-files-in-a-directory-but-not-in-subdirectories/
https://bytenota.com/ruby-how-to-convert-all-folder-subfolders-files-to-lowercase/
https://bytenota.com/solved-typescript-error-property-x-has-no-initializer-and-is-not-definitely-assigned-in-the-constructor/
https://bytenota.com/sovled-typescript-error-object-is-possibly-null-or-undefined/
https://bytenota.com/php-get-different-days-between-two-days/ 
...

You will eventually want to extract your foreach loop to a function that will call itself whenever it finds an .xml extension, thus making it recursive.

Note position 1 in the vardump data. It shows url but not shows url in the other positions. Why ?

This is because $xml->url IS the sitemap object, made up of various urls (or locations - loc). So it will be the same whether you traverse $xml or $xml->url in your foreach loop.

Urls do exist in other positions. You may confirm this manually:

Yes, they do exist, but those are locations (loc) under $xml->url. As you can also see, none of them are sitemaps (.xml files).

@dani

Do not worry. I myself returning here after 3 days. So, you did not keep me waiting.
No need for you to rush responding to my threads. take your time. After-all, I ain't paying you, like an employer.
I hope you are getting well.
Here, the tornado messed-up the neighbouring country and then visited us in 25 districts. Luckily our district did not get the visit but the wind from neighbouring districts did cause us a little hassle enough to have electricity power station blacking us out for 23 hrs.

@gce517

Oh no. Not objects again. It just does not get through to me.
Any chance you can get the above code to work (in it's simplest coding form) on any site ? Work on any xml sitemap ?
You know, I got Ubot Studio. Withit I can build desktop bots (.exe). I can easily build a bot to auto visit domains and find their siet maps and extract links. But that would mean, I would have to keep my home pc on 24/7 to crawl the whole web. I'd rather webmasters came to my webform and submitted their xml sitemaps so my web bot (.php) can then crawl their links. That way, I won't have to keep my pc on 24/7 since the web crawler will be on the vps host side and not on mine. Hence, all the fuss to build a php web crawler.
There are tonnes of php crawlers online. Free ones. But I do not understand their codes and I do not like building my website with other peoples' codes. I get no satisfaction that way. I prefer to learn and build things myself and then use my own little baby. My own built Frankenstein. Get kick that way.
I think you understand.
Maybe I call my web crawler "Frankenstien Crawler" ?
Once you have helped me on that, then I go and try to memorise the code, make slight changes (so not an exact copy of your code) and then get going setting my crawler loose on the www. :)
So, if you do not mind, then what is your sitemap url ? I might aswell crawl your website and see if it works there. And I can try crawling daniweb.com too. ;)

I am curious.
If each website's xml sitemap has different tree names then how come a general xml sitemap crawler extract all the site links. I mean, the crawler will be programmed to look for a certain named parent and child. It won't know what the site's xml tree nodes are called. What the parent is called, what the childs are called etc. in order to look for those particularly named nodes.
Most likely, the parser has some ai to detect the node names. Extract the parent and child names. That's the bot of code I need.

Btw, I am curious.
If each website's xml sitemap has different tree names then how come a general xml sitemap crawler extract all the site links ? I mean, the crawler will be programmed to look for a certain named parent and child. It won't know what the site's xml tree nodes are called. What the parent is called, what the childs are called etc. in order to look for those particularly named nodes.
Most likely, the parser has some ai to detect the node names. Extract the parent and child names and then use these to find or extract the site links. That's the bot of code I need.

Oh no. Not objects again. It just does not get through to me.

I'm sorry (not really).

Any chance you can get the above code to work (in it's simplest coding form) on any site ? Work on any xml sitemap ?

As it is, it should work on any xml sitemap that follows the sitemap protocol.

My own built Frankenstein.

I think you mean your "own built Monster." And remember, Frankenstein did not create the Monster from scratch. He used "parts" (snippets?) from different corpses (code?). So using other peoples' code is not necessarily a bad thing. At first, your end product will look ugly, relatively speaking, but as you learn more, you will improve upon it. Unless it is absolutely necessary (I understand where you are coming from), you don't want to reinvent the wheel. Do you?

Once you have helped me on that, then I go and try to memorise the code

I can't tell you what to do, but it will benefit you more to learn the concepts behind the code (including OOP) rather than the code itself. That way, you can apply those concepts in all your code. Perhaps that is what you meant? Either way, we've given you A LOT of information and resources to study, analyze, and learn from. Because you are not expecting us to code everything for you, right?

make slight changes (so not an exact copy of your code)

If you only make slight changes, it will still be someone else's code.

So, if you do not mind, then what is your sitemap url ?

I don't mind, but I don't have a website to share with you :-(

the parser has some ai to detect the node names

As long as the parser follows the protocol, it should be able to extract everything from the sitemap.

Hope this helps!

And remember, that code that was posted the other day (the BJ_Crawler one you were scared of because it was OOP) already correctly handles an infinite depth of sitemap index files and sitemap files. It also handles an unlimited number of sitemap or sitemap index files (from different websites) as starting points.

As far as where to get these starting points, users of your app can submit their sitemaps to you, or you can auto-detect sitemap files as they're often recorded in the website's robots.txt file, which is always located at domain.com/robots.txt

@gce517

I read this a wk earlier:
https://www.sitemaps.org/protocol.html

Nevertheless, this is what I learnt to use:

foreach ($xml->url as $urlElement) //Extracts Sitemap Urls.






foreach ($xml->sitemap as $urlElement) //Extracts Sitemap Urls.

Nah! I do not want anybody doing all the work for me. Only on the bits I am at a dead-end. That is all.
As of the last hr or 2, been fiddling with simplhtmldom. My new thread:
https://www.daniweb.com/programming/web-development/threads/538995/experimenting-with-simplehtmldom

@dani

I did not bother testing the BJ_Crawler as I know it will do all the job I want since you built it.
But going through the code makes my head swim. And so, made note of it in my Notepad++ for future use when I am more used to oop. That's why been fiddling with very simple code just now for an hr or 2. And got stuck:
https://www.daniweb.com/programming/web-development/threads/538995/experimenting-with-simplehtmldom

If I cannot get that simplhtmldom one to wwork then I'm afraid I'm gonna have to climb your CJ_Crawler Mount Everest soon. ;)

And you are right, the oop in it is giving me heart attack as oop makes me dizzy.
But, I might aswell look back into it every now and then and get over my fears by familiarising each line.

PS - I have been running away from oop since 2018. :(

I did not bother testing the BJ_Crawler as I know it will do all the job I want since you built it.

I did not build it. It was code provided on a webpage that YOU had linked to, so you found it.

@dani

Mmm. Then, I either forgot about it or overlooked it
Silly billy me!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.