Folks,

Using DomDocument, I am trying to build a crawler that, when I feed it a starting point url (initial url to start the crawling & link extracting from), it should navigate to the starting url and extract all the links found on the page.

<?php

$xml = file_get_contents($sitemapUrl); //Should I stick to this line or below line ?
// parse the sitemap content to object
$xml = simplexml_load_string($sitemapUrl); //Should I stick to this line or above line ?

$dom = new DOMDocument();
$dom->loadXML($xml);
if ($dom->nodeName === 'sitemapindex')
{
    //parse the index
    // retrieve properties from the sitemap object
    foreach ($xml->urlset as $urlElement) //Extracts html file urls.
    {
        // get properties
        $url = $urlElement->loc;
        $lastmod = $urlElement->lastmod;
        $changefreq = $urlElement->changefreq;
        $priority = $urlElement->priority;

        // print out the properties
        echo 'url: '. $url . '<br>';
        echo 'lastmod: '. $lastmod . '<br>';
        echo 'changefreq: '. $changefreq . '<br>';
        echo 'priority: '. $priority . '<br>';

        echo '<br>---<br>';
    }
} 
else if ($dom->nodeName === 'urlset')
{
    //parse url set
    // retrieve properties from the sitemap object
    foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
    {
        // get properties
        $url = $urlElement->loc;
        $lastmod = $urlElement->lastmod;
        $changefreq = $urlElement->changefreq;
        $priority = $urlElement->priority;

        // print out the properties
        echo 'url: '. $url . '<br>';
        echo 'lastmod: '. $lastmod . '<br>';
        echo 'changefreq: '. $changefreq . '<br>';
        echo 'priority: '. $priority . '<br>';

        echo '<br>---<br>';
    }
} 

Now, how to write code to extract meta tags using DomDocument ?
Where can I find the code here ?
https://www.php.net/domdocument

Recommended Answers

All 39 Replies

@dani

Why I coming across programmers who prefer to DomDocument over simple_html_fom ?
What is the pros & cons of both ?

Now, how to write code to extract meta tags using DomDocument ?

The simple answer is that you can’t. At least not with the code you have supplied. This code inspects an XML sitemap index and pulls out different properties in the xml file. These files don’t contain meta tags. They aren’t HTML files.

@dani

Oh! Thanks for pointing my error ways out.
So, if you do not mind, may I see how you yourself would write a very simple, concise code to extract the meta tags using DomDocu & simple html dom parsers ?
I want to compare both and see which one I can memorise better. So bare in mind, I am going to try to memorise your code line for line.

Also, just that I do not have to depend on you always to show me code snippets, can you show me on which DomDocument Parser manual link it shows me how to extract the meta tags ?
And ofcourse, can you also show me on which simple_html_dom Parser manual link it shows me how to extract the meta tags ?
I want to go through the docs of both parsers and learn to read & understand the docs. The docs are very big. I cannot be reading them from top to bottom. So best to start where it teaches how to extract the meta tags. Once I get the hang of how to read & understand the docs, I should be able to elarn to walk by myself than bug pros each & everytime for code snippets.
You see, I just learnt today how to read & understand the php functions syntaxes in the php.net. For 6yrs now, when I get pointed to the manual, reading the functions syntaxes, I never understand them. Instead, I check the code samples and from there determine how the functions work. What their params are and what kind of inputs the params intake. Today, I bugged some programmers to teach me how to understand the syntaxes and now I should be able to understand each functions syntaxes mentioned in the manual than do guess work. So, today has been my lucky day, Now, I am gonna bug you to do the same and teach me how to read the 2 parsers syntax (on the tutorials) on the docs.
You see, when in the past, you and other programmers gave me the 2 parsers codes and pointed me to the links in their docs, I gave one glance to the doc pages I was refered to and my head started spinning because I did not understand the syntaxes. And left double quick. That is why I am still at spot 1. Do not understand the parser syntaxes when I sometimes read some if their doc pages.
So to start with, show me the links of the 2 parsers that teach how to extract meta tags. And if you think I will still have questions as the pages describe things the hard way, not suitable for a beginner, then care to explain those bits in layman language. That should aid me to understand the puzzly bits.
Are you understanding what I am blabbering about or am I confusing you ?

@dani

Frankly, I do not understand the function syntaxes. That is what I told a programmer tonight.
Then another replied to read the manual section entitled How to read a function definition (prototype).
https://www.php.net/manual/en/about.prototypes.php#about.prototypes

Now you know. I do not know the proper fundamentals of programming or php. And yet, I have built pages such as reg, login, logout, search db (membership/account) pages using mysqli & prepared statements. Writing code from memory as I memorise the functions. I do not work on projects copy pasting code.
Only know procedural style. No pdo or oop. Ask me what is procedural style and what is oop and I do not know. Don't understand what all this object is. Someone tried explaining about class. All I understand is that a var got some characteristics and that is what they call class and object. Or whatever.

It's just, back in late 2015, I started learning php from php.net. Then found it too complicated. Tried tutorial sites instead and learnt the very basics. Php lang basic syntax.
Enrolled to php class but before long the teacher quit his job and gave his brother to take-over his position. Brother was able to teach all the other subjects, like css. But he did not know php and I found myself in a position that, I will have to teach him instead. I quit the school. And found myself dropped in the middle of the sea. Not knowing which direction to swim for (where the shore is). Put php on hold till early 2017. Stuck to it since.

So, by myself at home, started learning php but not in proper order. That is why I do not know some fundamentals. Got no proper guidance. I just learn by reading tutorials here and there online and you know very well most tutorials are outdated. Found that out the hard way. Had I known pdo was new then I never would have bothered with mysqli. That is one example.

https://www.php.net/manual/en/about.prototypes.php#about.prototypes

Above link begins with:

"Each function in the manual is documented for quick reference. Knowing how to read and understand the text will make learning PHP much easier. Rather than relying on examples or cut/paste, everyone should know how to read function definitions (prototypes)."

They took words right out of my mouth! Because that is what I have been doing for 6yrs now. I not know how to read the functions syntax explanations in the manual and always rely on code examples to figure-out how the functions work.

This should speak volumes of why you teach me something but I ask the same question again another day in another way as if I did not understand you in the past. I sometimes think I understood but turns out I did not. Or, understood it wrong.

So, if you do not mind, may I see how you yourself would write a very simple, concise code to extract the meta tags using DomDocu & simple html dom parsers ?

I would use DomDocument as it's built into PHP. However, in your code, you see you're loading an XML file (a sitemap or sitemap index file). Instead, I would do something like this:

// Initiate ability to manipulate the DOM and load that baby up
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
libxml_clear_errors();

// Fetch all <meta> tags
$meta_tags = $doc->getElementsByTagName('meta');

if ($meta_tags->length > 0)
{
    foreach ($meta_tags as $tag)
    {
         // Do something ...   
    }

}

Oh, and then for each $tag in the loop, you can do:

// e.g. name="robots" and content="noindex"
$name = $tag->getAttribute('name');
$content = $tag->getAttribute('content');

Also, thank you for a little history of your journey with PHP. I have been programming nearly my entire life, but I started with PHP about 22 years ago when I started DaniWeb using phpBB. I slowly started reading over all of the PHP code that powered it and made small changes and adjustments here and there to customize. Then, I switched to vBulletin, and started making smaller modifications there. Then larger modifications. Then, eventually vBulletin was sold and I needed to switch again, but this time I decided to take all that knowledge I had on how phpBB was written and how vBulletin was written, and write my own PHP-based forum platform from scratch.

@dani

Thank you too for bothering twice to relate your programming history.
You see, I maybe a complete newbie in programming but I have some business ideas. I am not daft to try to build different scripts for each of the following:

CLONES
google/bing/yahoo
twitter
facebook/myspace
reddit/digg
yahoomail/hotmail/gmail

I just spending these 6yrs learning the basics of an interactive or dynamic website (db interaction to dump & extract data). Then build a membership site. Website that gives users accounts. Build reg, login, logout, site search, etc. pages. The usual pages a membership site has. And use that membership script as a template/skeleton.
So, if now I want to run a searchengine, I just tweak a few lines and turn the membership template into a searchengine. And, when I want to run a social network (SN), then again use the template.Tweak here & there.
That is my goal. Hence, I busy building the membership pages.
Took me time to build the search page (pagination).
Now busy building crawler. Once that last one is finished, then I can launch my websites online one after the other.
I have a motto. If you want to compete with the big dogs online, make sure you add features for your suers to earn money from them. That will make them think of your new websites.

Anyway, might aswell build a forum too one day. A money making forum for:

new moms (those on maternity leave)
widows
unemployed
house wives

That should attract people to my websites.
Anyway, got to get that crawler finish and end this 6yrs of php-ing. Tired of php now.
Once the crawler is finished, I quitting php for a much easier & newer lang that is taught to 12-13yr olds in UK & USA schools. Python. I should struggle less with that one than php.

Is it not amazing that I do not know oop, not even what means an "argument" or "object" and yet have 95% completed my membership template ? Lol! No programmer would believe me. But, you will.
The templates I am building. Their code I wrote from scratch & memory. No copy & paste. I only copy codes during learning process. Once elarnt it, I write on a new file from memory and make that one of the TEMPLATE page.

@dani

What do you mean by:
// e.g. name="robots" and content="noindex"

If the robots file says not to index the page (eg. daniweb.com/home.php) then why should I be extracting metas from the page (eg. daniweb.com/home.php) ?

@dani

<?php

$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

$html = file_get_contents($url);

// Initiate ability to manipulate the DOM and load that baby up
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
libxml_clear_errors();

// Fetch all <meta> tags
$meta_tags = $doc->getElementsByTagName('meta');

if ($meta_tags->length > 0)
{
    foreach ($meta_tags as $tag)
    {
        // e.g. name="robots" and content="noindex"
        echo $name = $tag->getAttribute('name'); echo '<br>';
        echo $content = $tag->getAttribute('content');  echo '<br>';
    }
}

?>

I am scratching my head. This is what the crawler extracted from your page:

Meta Name:
Meta Description:
Meta Name: viewport
Meta Description: width=device-width, initial-scale=1
Meta Name:
Meta Description: IE=edge
Meta Name: keywords
Meta Description: web development,php,does,contain,does,contain,forum,community,discussion,message board,help,question,Q&A
Meta Name: description
Meta Description: Ladies & Gentlemen, I got this array: $test = array('id','date_and_time','kw_1','kw_1_point','kw_2','kw_2_point','kw_3','kw_3_point','kw_4','kw_4_point'); ... Meta Name: Meta Description: How To Find DOES NOT CONTAIn or DOES CONTAIN ? Meta Name: Meta Description: Ladies & Gentlemen, I got this array: $test = array('id','date_and_time','kw_1','kw_1_point','kw_2','kw_2_point','kw_3','kw_3_point','kw_4','kw_4_point'); ...
Meta Name:
Meta Description: article
Meta Name:
Meta Description: https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain
Meta Name:
Meta Description: https://static.daniweb.com/connect/images/anonymous.png
Meta Name: twitter:image
Meta Description: https://static.daniweb.com/connect/images/anonymous.png
Meta Name: twitter:card
Meta Description: summary
Meta Name:
Meta Description: DaniWeb
Meta Name: twitter:site
Meta Description: @DaniWeb
Meta Name: twitter:creator
Meta Description: @DaniWeb
Meta Name: twitter:via
Meta Description: DaniWeb
Meta Name: twitter:url
Meta Description: https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain
Meta Name: twitter:title
Meta Description: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name: twitter:description
Meta Description: Ladies & Gentlemen, I got this array: ```` $test = array('id','date_and_time','kw_1','kw_1_point','kw_2','kw_2_point','kw_3','kw_3_point','kw_4','kw_4_point');
Meta Name:
Meta Description: 1
Meta Name:
Meta Description: 2
Meta Name:
Meta Description: 3
Meta Name:
Meta Description: php - How To Find DOES NOT CONTAIn or DOES CONTAIN ...
Meta Name:
Meta Description: 4
Meta Name:
Meta Description: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name:
Meta Description: active
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: Re: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: Re: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: Re: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: Re: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: Re: How To Find DOES NOT CONTAIn or DOES CONTAIN ?
Meta Name:
Meta Description: 80
Meta Name:
Meta Description: 80

80**

Do most of them lines look valid to you ? I see replicas of same lines. Why the repeatings ?

@dani

Do ignore the post before my previous post. I now understand what you meant in your "eg":
// e.g. name="robots" and content="noindex".

@dani

Can you show me the DomDocument links where you got the following meta tags extractor codes from ?

<?php

$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

$html = file_get_contents($url);

// Initiate ability to manipulate the DOM and load that baby up
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
libxml_clear_errors();

// Fetch all <meta> tags
$meta_tags = $doc->getElementsByTagName('meta');

if ($meta_tags->length > 0)
{
    foreach ($meta_tags as $tag)
    {
        // e.g. name="robots" and content="noindex"
        echo $name = $tag->getAttribute('name'); echo '<br>';
        echo $content = $tag->getAttribute('content');  echo '<br>';
    }
}

?>

I am guessing one link did not teach the whole code I see above but 3 links. One link taught you how to write error reporting for the parser, another link how to extract from different elements (how to extract from different tags) and another link how to create a new dom parser project.
I do not need the final one (where you got the code: $doc = new DOMDocument()), but I need the links to the other 2. So I can read the appropriate links.

In short:
Q1. Which link in the DOCs did you get this code from ?

libxml_use_internal_errors(true);
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
libxml_clear_errors();

I need to learn what these error reporting lines mean and how many alternatives there are. hence, all the fuss.

Q2. And, which link in the DOCs did you get this code from ?

$doc->getElementsByTagName('meta');
// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);

$doc = new DOMDocument();

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

// https://www.php.net/manual/en/domdocument.getelementsbytagname.php
$meta_tags = $doc->getElementsByTagName('meta');

// https://www.php.net/manual/en/domnodelist.item.php
if ($meta_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($meta_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo $name = $tag->getAttribute('name'); echo '<br>';
        echo $content = $tag->getAttribute('content');  echo '<br>';
    }
}

@dani

Thank you.
And how would you extract the title ?
All Fails:

$title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    foreach ($title_tags as $tag)
    {
        echo 'Title: ' .$name = $tag->getAttribute('text'); echo '<br>';
    }
}
$title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    foreach ($title_tags as $tag)
    {
        echo 'Title: ' .$name = $tag->getAttribute('textContent'); echo '<br>';
    }
}
$title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$name = $tag->getAttribute('innertext'); echo '<br>';
    }
}

@dani

Found this:
https://stackoverflow.com/questions/5869925/grabbing-title-of-a-website-using-dom

$title = '';
$dom = new DOMDocument();

if($dom->loadHTMLFile($urlpage)) {
    $list = $dom->getElementsByTagName("title");
    if ($list->length > 0) {
        $title = $list->item(0)->textContent;
    }
}

Working. But not in your structure that fetched the meta tags.
So modified your meta tags fetching code, doing guess work:

$title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$name = $tag->getAttribute('title[0]'); echo '<br>';
    }
}

Note the indice I added. Simply copied the idea from the previous found code as it had array index.
however, still in the dark to why adding ````(0) worked.

In the code from Stack Overflow, line 5 $list = $dom->getElementsByTagName("title"); says to fetch a list of elements that match the critera where <title>...</title> appears in the HTML. The same way there can be many <div>, or multiple <meta ...>, there can theoretically be multiple <title> too.

Line 6, if ($list->length > 0) says to check if there is at least one item in the list of elements.

Line 7, $title = $list->item(0)->textContent; says to look at item(0) aka the first item in the list, and then fetch its content.

Now if we look at the example you wrote, line 1 says to fetch a list of elements that have the <title> tag. Line 2 says to check if this list has at least 1 element in it. Line 5 says to loop through the list of <title> tags in the HTML page. Line 8 says, that for each <title> tag, get the attribute called title[0]. That doesn't make sense. Essentially, it would be looking for HTML code that looks like this: <title title[0]="Page Title"> which obviously doesn't make sense. Instead, if you want to modify your code, you would make line 8 look like this: echo 'Title: ' .$name = $tag->textContent; echo '<br>'; However, using a loop doesn't make much sense here because we can expect the page to only have one title tag in all of the HTML, so it's fine to just fetch the first (and probably only) occurrence with item(0).

@dani

I tested my latter code again that I mentioned in my previous post. And you are right. It does not work. I thought it worked and asked you why it does since I thought it was not making any sense. Reason why I got the tiotle echoed because I had the stackoverflow code beneath it and it was the one echoing the code and I thought my buggy code was.
Anyway, I am trying to fix my buggy code but failing. So far it looks like this:

Fail

$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);

//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

$title_tag = $doc->getElementsByTagName('title');
if ($title_tag->length > 0)
{
    echo 'Title: ' .$title = $title_tag[0]->getAttribute('title'); echo '<br>';
}

So, switched it to:

 echo 'Title: ' .$title = $title_tag->getAttribute('title'); echo '<br>';

Fail too.
And this a fail:

 echo 'Title: ' .$title = $title_tag->getnodeValue('title'); echo '<br>';

Note, I am not familiar with the -> as I not into oop yet.
The first one should have worked, based on logic.
Trying to keep the code to your structure. The code that extracted the meta tags.

I think these lines need to be changed since we not dealing with xml files anymore but html pages. .php, .jtm, .html, etc. WIll be extracting titles from title tags on such pages. Not from pages that got xml extention.

echo 'Title: ' .$title = $title_tag->item(0)->textContent; echo '<br>';

to get the first title item, and then get its textContent.

Note, I am not familiar with the -> as I not into oop yet.

It means to get a property or perform a function on an object. In this case, $doc is a DOM document. $title_tag is a list of DOM node elements. $title_tag->item(0) is the first DOM node element in the list.

@dani

<?php   
$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);

//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

$title_tag = $doc->getElementsByTagName('title');
if ($title_tag->length>0)
{
    echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
}
die;

Thanks!
But is there any lines that should not be there ?
Since this part of the code will not be dealing with xml files is the above xml error reporting code necessary ?
Maybe, I should add some other error reporting code that deals with failure to extract from regular html pages ?

I was under the impession that loadHTML() (and not just loadXML()) uses libxml. I might be wrong.

Just confirmed that's the case. If you go to:

https://www.php.net/manual/en/domdocument.loadhtml.php

it says:

While malformed HTML should load successfully, this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.

So that's where I got the idea to use the libxml error handling functions.

@dani

Mmm. You had to read through the whole document in your student life to pick a thing or two like that and remember it.
I do not bother nowadays to read whole docs as I found I forget the first half by the time I finish reading the 2nd half. And so, I just stick to reading those parts that are relevant to the code or function I am working on. I have short memory. You have a long one. That is good as I can make use of your memory. And other users too. Thank God for that! Thank God you are helpful. I see no other women programmers across the internet but you. You should be all over CNN by now!

Anyway, if I did not misunderstand you, you advising to leave the code as it is. And so, I stick to the code you see in my latest post above.

Thanks!

Yes, I too am constantly reading php.net docs each time something relevant to what I'm working on comes up. It's a very handy reference :) It's pretty much always a tab in my browser. However, after over 20 years of PHP programming, I think I must have read the entire docs ten times over already.

On line 22, $title_tag[0] and $title_tag->item(0) should be doing the same thing, but this is untested. If it works, great :)

@dani

Since my whole searchengine is based on procedural style and mysqli, then I do not want my crawler to be mixed with oop code.
Let's replace the oop -> to procedural style marker. How to do this:

$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);

//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

$title_tag = $doc->getElementsByTagName('title');
if ($title_tag->length>0)
{
    echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
}

Since I got your attention, I might aswell finish the crawler as it's taking too long to complete. Might aswell bug you just a little longer tonight and finish the crawler. Then, once that is out of the way, I can jump to building the .exe crawler for Windows os.
Let me know what kind of features you reckon the php crawler should have as min. Right now, I can't think of any.

You know what ? If you are not too busy, can you make a long list of min features you think the crawler should have ? Then, I can try building these features.

@dani

Does not forum allow private messaging ?

It’s not possible to rewrite this code not using OOP. Functionality such as DomDocument has no non-oop equivalent.

You can message me privately by clicking on my username/avatar to go to my profile and then clicking the “Continue Chat” button towards the bottom.

Unfortunately I’m not home right now and have a busy day ahead of me.

Web Gurus,

I thought I finished the crawler but I get this error that $html_page_urls is not defined:

If you look at the first few lines of the whole script:

<?php

//Preparing Crawler & Session: Initialising Variables.

//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemap.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.

//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).

Then you see this has been defined. Look:

$html_page_urls  = []; //Same as: $html_page_urls  = array();

And so, I do not understand why I get error that this is not defined.
I get error on this line:

function scrape_page_data()
{
    if(array_count_values($html_page_urls)>0)

CONTEXT

<?php

//Preparing Crawler & Session: Initialising Variables.

//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemap.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.

//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.

//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.

// -----

//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap.
$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.

$xml = file_get_contents($sitemap); //Should I stick to this line or below line ?
// parse the sitemap content to object
//$xml = simplexml_load_string($sitemap); //Should I stick to this line or above line ?

$dom = new DOMDocument();
$dom->loadXML($xml);

//Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
{
    //parse the index
    // retrieve properties from the sitemap object
    foreach ($xml->urlset as $urlElement) //Extracts html file urls.
    {
        // get properties
        $sitemaps[] = $sitemap_url = $urlElement->loc;
        $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
        $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
        $sitemaps_priorities[] = $priority = $urlElement->priority;

        // print out the properties
        echo 'url: '. $sitemap_url . '<br>';
        echo 'lastmod: '. $last_mod . '<br>';
        echo 'changefreq: '. $change_freq . '<br>';
        echo 'priority: '. $priority . '<br>';

        echo '<br>---<br>';
    }
} 
else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
{
    //parse url set
    // retrieve properties from the sitemap object
    foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
    {
        // get properties
        $html_page_urls[] = $html_page_url = $urlElement->loc;
        $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
        $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
        $html_page_priorities[] = $priority = $urlElement->priority;

        // print out the properties
        echo 'url: '. $html_page_url . '<br>';
        echo 'lastmod: '. $last_mod . '<br>';
        echo 'changefreq: '. $change_freq . '<br>';
        echo 'priority: '. $priority . '<br>';

        echo '<br>---<br>';
    }
} 
else 
{
    //Scrape Webpage Data as current page is an hmtl page for visitors and no Xml SiteMap page for Crawlers.
    //scrape_page_data(); //Scrape Page Title & Meta Tags.
}

echo 'SiteMaps Crawled: ---';echo '<br><br>'; 
if(array_count_values($html_page_urls)>0)
{   
    print_r($sitemaps);
    echo '<br>';
}
elseif(array_count_values($sitemaps_last_mods)>0)
{   
    print_r($sitemaps_last_mods);
    echo '<br>';
}
elseif(array_count_values($sitemaps_change_freqs)>0)
{   
    print_r($sitemaps_change_freqs);
    echo '<br>';
}
elseif(array_count_values($sitemaps_priorities)>0)
{   
    print_r($sitemaps_priorities);
    echo '<br><br>'; 
}

echo 'Html Pages Crawled: ---'; echo '<br><br>'; 

if(array_count_values($html_page_urls)>0)
{   
    print_r($html_page_urls);
    echo '<br>';
}
if(array_count_values($html_page_last_mods)>0)
{   
    print_r($html_page_last_mods);
    echo '<br>';
}
if(array_count_values($html_page_change_freqs)>0)
{   
    print_r($html_page_change_freqs);
    echo '<br>';
}
if(array_count_values($html_page_priorities)>0)
{   
    print_r($html_page_priorities);
    echo '<br>';
} 

scrape_page_data(); //Scrape Page Title & Meta Tags.

function scrape_page_data()
{
    if(array_count_values($html_page_urls)>0)
    {       
        foreach($html_page_urls AS $url)
        {
            //Extract Page's Meta Data & Title.
            file_get_contents($url);

            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);

            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();

            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);

            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();

            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');

            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Name: ' .$name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Content: ' .$content = $tag->getAttribute('content');  echo '<br>';
                }
            }

            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
            }

            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');

            for ($i = 0; $i < $title_tag->length; $i++) {
                echo $title_tag->item($i)->nodeValue . "\n";
            }
        }
    }
}

test the code and see in your localhost!
Puzzling!
It's 3:06am here and I do not have sleep in my eyes to do a typo herein the $var name!

@pritaeas

If you do not mind me asking. What you make out of this nonsensical error you see above ?

@jawass

Are you a php developer ?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.