DomDocument Parser Best Practice Questions

Question

borobhaisab 117 Posting Whiz

2 Years Ago

@dani

Checking these 2 working codes of your's out. I got some basic questions.

1

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);

//Dan's Code.
//Code from: https://www.daniweb.com/programming/web-development/threads/538868/simplehtmldom-failing#post2291972
//Sitemap Protocol: https://www.sitemaps.org/protocol.html

// Initiate ability to manipulate the DOM and load that baby up
$doc = new DOMDocument();

$message = file_get_contents('https://www.daniweb.com/programming/web-development/threads/538868/simplehtmldom-failing#post2288453');

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($message, LIBXML_NOENT|LIBXML_COMPACT);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

// Fetch all <a> tags
$links = $doc->getElementsByTagName('a');

// If <a> tags exist ...
if ($links->length > 0)
{
    // For each <a> tag ...
    foreach ($links AS $link)
    {
        $link->setAttribute('class', 'link-style');
    }
}
// Because we are actually manipulating the DOM, DOMDocument will add complete <html><body> tags we need to strip out
$message = str_replace(array('<body>', '</body>'), '', $doc->saveHTML($doc->getElementsByTagName('body')->item(0)));

?>

2

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);

//Dan's Code.
//CODE FROM: https://www.daniweb.com/programming/web-development/threads/540121/how-to-extract-meta-tags-using-domdocument
$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);

//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

//EXTRACT METAS
// https://www.php.net/manual/en/domdocument.getelementsbytagname.php
$meta_tags = $doc->getElementsByTagName('meta');

// https://www.php.net/manual/en/domnodelist.item.php
if ($meta_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($meta_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Name: ' .$name = $tag->getAttribute('name'); echo '<br>';
        echo 'Content: ' .$content = $tag->getAttribute('content');  echo '<br>';
    }
}

//EXAMPLE 1: EXTRACT TITLE
//CODE FROM: https://www.daniweb.com/programming/web-development/threads/540121/how-to-extract-meta-tags-using-domdocument
$title_tag = $doc->getElementsByTagName('title');
if ($title_tag->length>0)
{
    echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
}

?>

Q1.
On the first code, you wrote new DOMDocument(); prior to file_get_contents().
While on the second code, you did vice versa. using my logic, I reckon it does not matter the order. But what is best practice to speeden-up the php interpreter to handle the job faster ?

Q2.
On both the codes, you wrote ...

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

... after the new DOMDocument() AND file_get_contents().

Does it have to be in this order or can I add thesese 3 error lines before the
new DOMDocument() AND file_get_contents() ?
Using my logic, I reckon it does not matter the order. But what is best practice to speeden-up the php interpreter to handle the job faster ?

But, I prefer to add them at the top instead. Is this ok ?

Q3.
On the first code, you put these error lines ...

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($message, LIBXML_NOENT|LIBXML_COMPACT);

... while on the second code, another ...

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

... why you did like this ? What is the significance of doing like this ?

Q3A. What issue will I face if I do vice versa ?
Q3B. Anyway, what is the wisdom behind the way you did things ?
Q3C. What is the REAL difference between the two error codes ?
Q3D. LIBXML_NOENT|LIBXML_COMPACT what do these 2 mean ?

Q4. Anything else I need to know ?

html-css php programming-construct

Edited 2 Years Ago by borobhaisab

3 Contributors
9 Replies
96 Views
1 Week Discussion Span
Latest Post 2 Years Ago Latest Post by borobhaisab

All 9 Replies

Dani 4,675 The Queen of DaniWeb

2 Years Ago

Is simplexml_load_string() an Xml File Parser just like simple_html_dom() and domdocument() are html parsers ? Out of the latter two, which one you prefer and why that one over the other ?

Kinda, yeah. Instead of creating a DOMDocument it creates a SimpleXML element. I don't personally have experience traversing SimpleXML elements so I can't comment on which I prefer for XML files.

Edited 2 Years Ago by Dani

pritaeas 2,211 ¯\_(ツ)_/¯

2 Years Ago

I primarily used https://www.php.net/manual/en/ref.xml.php

borobhaisab commented: Thanks a bunch! +4

Dani 4,675 The Queen of DaniWeb

2 Years Ago

I’m sure he will like that it’s procedural :)

borobhaisab commented: You a 100% right there! Checking it oput now. +4

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

borobhaisab 117 Posting Whiz · Answer 1 · 2023-05-15T22:08:14+00:00

@dani

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);
?>

<?php

//START OF SCRIPT FLOW.

//Preparing Crawler & Session: Initialising Variables.

//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.

//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.

//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
// -----

//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml SItemap).
$initial_url = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.

//$xmls = file_get_contents($initial_url); //Should I stick to this line or below line ?
//Parse the sitemap content to object
//$xml = simplexml_load_string($xmls); //Should I stick to this line or above line ?
$xml = simplexml_load_string(file_get_contents($initial_url)); //Code from Dani: https://www.daniweb.com/programming/web-development/threads/540168/what-to-lookout-for-to-prevent-crawler-traps

$dom = new DOMDocument();
$dom->loadXML($xml); //LINE: 44

echo __LINE__; echo '<br>'; //LINE: 46

extract_links($xml);

echo __LINE__; echo '<br>';  //LINE: 50

foreach($sitemaps AS $sitemap)
{
    echo __LINE__; echo '<br>';
    extract_links($sitemap); //Extract Links on page.
}

foreach($html_page_urls AS $html_page_url)
{
    echo __LINE__; echo '<br>';
    extract_links($html_page_url); //Extract Links on page.
}

scrape_page_data(); //Scrape Page Title & Meta Tags.

//END OF SCRIPT FLOW.

//DUNCTIONS BEYOND THIS POINT.

//Links Extractor.
function extract_links()
{
    echo __LINE__; echo '<br>';  //LINE: 73

    GLOBAL $dom;
    //Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
    if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
    {
        echo __LINE__; echo '<br>';

        //parse the index
        // retrieve properties from the sitemap object
        foreach ($xml->sitemapindex as $urlElement) //Extracts html file urls.
        {
            // get properties
            $sitemaps[] = $sitemap_url = $urlElement->loc;
            $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
            $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
            $sitemaps_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $sitemap_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    } 
    else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
    {
        echo __LINE__; echo '<br>';

        //parse url set
        // retrieve properties from the sitemap object
        foreach ($xml->urlset as $urlElement) //Extracts Sitemap Urls.
        {
            // get properties
            $html_page_urls[] = $html_page_url = $urlElement->loc;
            $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
            $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
            $html_page_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $html_page_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    } 

    GLOBAL $sitemaps;
    GLOBAL $sitemaps_last_mods;
    GLOBAL $sitemaps_change_freqs;
    GLOBAL $sitemaps_priorities;

    GLOBAL $html_page_urls;
    GLOBAL $html_page_last_mods;
    GLOBAL $html_page_change_freqs;
    GLOBAL $html_page_priorities;

    echo 'SiteMaps Crawled: ---'; echo '<br><br>'; 
    if(array_count_values($sitemaps)>0)
    {   
        print_r($sitemaps);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_last_mods)>0)
    {   
        print_r($sitemaps_last_mods);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_change_freqs)>0)
    {   
        print_r($sitemaps_change_freqs);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_priorities)>0)
    {   
        print_r($sitemaps_priorities);
        echo '<br><br>'; 
    }

    echo 'Html Pages Crawled: ---'; echo '<br><br>'; 

    if(array_count_values($html_page_urls)>0)
    {   
        print_r($html_page_urls);
        echo '<br>';
    }
    if(array_count_values($html_page_last_mods)>0)
    {   
        print_r($html_page_last_mods);
        echo '<br>';
    }
    if(array_count_values($html_page_change_freqs)>0)
    {   
        print_r($html_page_change_freqs);
        echo '<br>';
    }
    if(array_count_values($html_page_priorities)>0)
    {   
        print_r($html_page_priorities);
        echo '<br>';
    }
}

//Meta Data & Title Extractor.
function scrape_page_data()
{
    GLOBAL $html_page_urls;
    if(array_count_values($html_page_urls)>0)
    {       
        foreach($html_page_urls AS $url)
        {
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);

            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();

            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);

            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();

            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');

            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Meta Name: ' .$meta_name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Meta Content: ' .$meta_content = $tag->getAttribute('content');  echo '<br>';
                    $html_page_meta_names[] = $meta_name;
                    $html_page_meta_descriptions[] = $meta_content;
                }
            }

            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
                $html_page_titles[] = $title;
            }

            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');

            for ($i = 0; $i < $title_tag->length; $i++) {
                echo 'Title: ' .$title = $title_tag->item($i)->nodeValue . "\n";
                $html_page_titles[] = $title;
            }
        }
    }
}

if(array_count_values($html_page_meta_names)>0)
{   
    print_r($html_page_meta_names);
    echo '<br>';
}

if(array_count_values($html_page_meta_descriptions)>0)
{   
    print_r($html_page_meta_descriptions);
    echo '<br>';
}

if(array_count_values($html_page_titles)>0)
{   
    print_r($html_page_titles);
    echo '<br>';
}

//END OF FUNCTIONS.
die;

?>

I am getting this error:

**
( ! ) Warning: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 6 in C:\wamp64\www\Work\buzz\Templates\crawler_Test.php on line 44
Call Stack

Time Memory Function Location

1 0.0077 363688 {main}( ) ...\crawler_Test.php:0
2 3.2055 366384 loadXML( $source = class SimpleXMLElement { public $sitemap = [0 => class SimpleXMLElement { ... }, 1 => class SimpleXMLElement { ... }, 2 => class SimpleXMLElement { ... }, 3 => class SimpleXMLElement { ... }] } ) ...\crawler_Test.php:44
46
73
SiteMaps Crawled: ---

Array ( )
Html Pages Crawled: ---

Array ( )
Array ( )
Array ( )
Array ( )
50
Array ( )
Array ( )
Array ( ) **

Q5A.
Why I seeing this error ?
I reckon I should add appropriate error reporting lines. If so, then ...
But where ?
Shall I add it just after:

$xml = simplexml_load_string(file_get_contents($initial_url)); //Code from Dani: https://www.daniweb.com/programming/web-development/threads/540168/what-to-lookout-for-to-prevent-crawler-traps

$dom = new DOMDocument();
$dom->loadXML($xml); //LINE: 44

Or,

shall I add it inside:

function extract_links()
{

Q5B.
And what error reporting lines I should add ?

Q6.
Or is the error due to something else like my bad coding ? If so, how to fix it ? Error happens here:

$dom->loadXML($xml); //LINE: 44

Thank you for your aid!

borobhaisab 117 Posting Whiz · Answer 2 · 2023-05-15T22:15:02+00:00

@reverend_jim

Do you use simple_html_dom() or domdocument() for html parsing ?

borobhaisab 117 Posting Whiz · Answer 3 · 2023-05-15T22:16:29+00:00

@pritaeas

Is simplexml_load_string() an Xml File Parser just like simple_html_dom() and domdocument() are html parsers ? Out of the latter two, which one you prefer and why that one over the other ?

Thank you

borobhaisab 117 Posting Whiz · Answer 4 · 2023-05-22T16:26:59+00:00

Gurus,

I do not know why my crawler failing to crawl pages on my localhost Xampp.
The crawl initiating url is: http://localhost/test/0.xml
Now,look at the context of 0.xml file ....

Contents of all Files Crawler trying to spider:
0.xml

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://localhost/test/1.xml</loc>
        <lastmod>2023-05-22T08:33:17+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://localhost/test/2.xml</loc>
        <lastmod>2023-05-22T08:33:17+00:00</lastmod>
    </sitemap>
</sitemapindex>

1.xml

<?xml version="1.0" encoding="UTF-8"?> 
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://localhost/test/1a.xml</loc>
        <lastmod>2023-05-22T08:33:17+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://localhost/test/1b.xml</loc>
        <lastmod>2023-05-22T08:33:17+00:00</lastmod>
    </sitemap>
</sitemapindex>

1a.xml

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>

      <loc>http://localhost/test/url_1a.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>
   <url>

      <loc>http://localhost/test/url_1b.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>

</urlset>

1b.xml

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>

      <loc>http://localhost/test/url_1c.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>
   <url>

      <loc>http://localhost/test/url_1d.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>

</urlset>

2.xml

<?xml version="1.0" encoding="UTF-8"?> 
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://localhost/test/2a.xml</loc>
        <lastmod>2023-05-22T08:33:17+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://localhost/test/2b.xml</loc>
        <lastmod>2023-05-22T08:33:17+00:00</lastmod>
    </sitemap>
</sitemapindex>

2a.xml

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>

      <loc>http://localhost/test/url_2a.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>
   <url>

      <loc>http://localhost/test/url_2b.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>

</urlset>

2b.xml

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>

      <loc>http://localhost/test/url_2c.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>
   <url>

      <loc>http://localhost/test/url_2d.html</loc>

      <lastmod>2023-05-22</lastmod>

      <changefreq>hourly</changefreq>

      <priority>0.8</priority>

   </url>

</urlset>

url_1a.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 1a">
  <meta name="keywords" content="kw_01a,kw_1a">
  <meta name="author" content="John Doe 1">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_1b.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 1b">
  <meta name="keywords" content="kw_01b,kw_1b">
  <meta name="author" content="John Doe 1">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_1c.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 1c">
  <meta name="keywords" content="kw_01c,kw_1c">
  <meta name="author" content="John Doe 1">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_1d.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 1d">
  <meta name="keywords" content="kw_1,kw_d">
  <meta name="author" content="John Doe 1">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_2a.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 2a">
  <meta name="keywords" content="kw_2,kw_a">
  <meta name="author" content="John Doe 2">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_2b.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 2b">
  <meta name="keywords" content="kw_2,kw_b">
  <meta name="author" content="John Doe 2">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_2c.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 2c">
  <meta name="keywords" content="kw_2,kw_c">
  <meta name="author" content="John Doe 2">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

url_2d.html

<html>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Content 2d">
  <meta name="keywords" content="kw_2,kw_d">
  <meta name="author" content="John Doe 2">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
</body>
</html>

Do let me know why the crawling fails.
Checking above, you will see that, the xml files got the right xml tags and the html files got the right html tags. And so, I do not understand why the crawling fails.

And, if I replace this line:

$result = @$dom->loadXML($xml); //LINE: 42

with this:

$dom->loadXML($xml); //LINE: 42

I get echoed the following with error.
Error I have bolded it ...

( ! ) Warning: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 4 in C:\wamp64\www\test\crawler_Test.php on line 42
Call Stack
# Time Memory Function Location
1 0.0032 361856 {main}( ) ...\crawler_Test.php:0
2 0.0121 363976 loadXML( $source = class SimpleXMLElement { public $sitemap = [0 => class SimpleXMLElement { ... }, 1 => class SimpleXMLElement { ... }] } ) ...\crawler_Test.php:42

48
70
SiteMaps Crawled: ---

Array ( )
Html Pages Crawled: ---

Array ( )
Array ( )
Array ( )
Array ( )
49
Array ( )
Array ( )
Array ( )

Why I getting this xml tag error as the crawled pages got no xml tags missing. See for yourself the xml codes above.
What is going on here ? This is very fishy as I going round in circles for nearly a week!. Unable to fix this issue. No matter what I try.
Note the echo content above. Note the empty array(). It meabs nothing is getting extracted to the arrays. No xml or html links.

And, even if I suppress the error by replacing this:

$dom->loadXML($xml); //LINE: 42

to this:

$result = @$dom->loadXML($xml); //LINE: 42

No luck! Same result. Arrays are empty holding no extracted data.
I am puzzled!

The crawler:

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);

//START OF SCRIPT FLOW.

//Preparing Crawler & Session: Initialising Variables.

//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.

//Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.

//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
// -----

//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml Sitemap).
$initial_url = "http://localhost/test/0.xml";

//$xmls = file_get_contents($initial_url); //Should I stick to this line or below line ?
//Parse the sitemap content to object
//$xml = simplexml_load_string($xmls); //Should I stick to this line or above line ?
$xml = simplexml_load_string(file_get_contents($initial_url)); //Code from Dani: https://www.daniweb.com/programming/web-development/threads/540168/what-to-lookout-for-to-prevent-crawler-traps

$dom = new DOMDocument();
$dom->loadXML($xml); //LINE: 42
//$result = @$dom->loadXML($xml); //LINE: 42

echo __LINE__; echo '<br>'; //LINE: 45

extract_links($xml);

echo __LINE__; echo '<br>';  //LINE: 4

foreach($sitemaps AS $sitemap)
{
    echo __LINE__; echo '<br>';
    extract_links($sitemap); //Extract Links on page.
}

foreach($html_page_urls AS $html_page_url)
{
    echo __LINE__; echo '<br>';
    $scrape_page_data($html_page_url); //Extract Links on page.
}

//END OF SCRIPT FLOW.

//DUNCTIONS BEYOND THIS POINT.

//Links Extractor.
function extract_links()
{
    echo __LINE__; echo '<br>';  //LINE: 73

    GLOBAL $dom;
    //Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
    if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
    {
        echo __LINE__; echo '<br>';

        //parse the index
        // retrieve properties from the sitemap object
        foreach ($xml->sitemapindex as $urlElement) //Extracts xml file urls.
        {
            // get properties
            $sitemaps[] = $sitemap_url = $urlElement->loc;
            $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
            $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
            $sitemaps_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $sitemap_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    } 
    else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
    {
        echo __LINE__; echo '<br>';

        //parse url set
        // retrieve properties from the sitemap object
        foreach ($xml->urlset as $urlElement) //Extracts Sitemap Urls.
        {
            // get properties
            $html_page_urls[] = $html_page_url = $urlElement->loc;
            $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
            $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
            $html_page_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $html_page_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    } 

    GLOBAL $sitemaps;
    GLOBAL $sitemaps_last_mods;
    GLOBAL $sitemaps_change_freqs;
    GLOBAL $sitemaps_priorities;

    GLOBAL $html_page_urls;
    GLOBAL $html_page_last_mods;
    GLOBAL $html_page_change_freqs;
    GLOBAL $html_page_priorities;

    echo 'SiteMaps Crawled: ---'; echo '<br><br>'; 
    if(array_count_values($sitemaps)>0)
    {   
        print_r($sitemaps);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_last_mods)>0)
    {   
        print_r($sitemaps_last_mods);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_change_freqs)>0)
    {   
        print_r($sitemaps_change_freqs);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_priorities)>0)
    {   
        print_r($sitemaps_priorities);
        echo '<br><br>'; 
    }

    echo 'Html Pages Crawled: ---'; echo '<br><br>'; 

    if(array_count_values($html_page_urls)>0)
    {   
        print_r($html_page_urls);
        echo '<br>';
    }
    if(array_count_values($html_page_last_mods)>0)
    {   
        print_r($html_page_last_mods);
        echo '<br>';
    }
    if(array_count_values($html_page_change_freqs)>0)
    {   
        print_r($html_page_change_freqs);
        echo '<br>';
    }
    if(array_count_values($html_page_priorities)>0)
    {   
        print_r($html_page_priorities);
        echo '<br>';
    }
}

//Meta Data & Title Extractor.
function scrape_page_data()
{
    GLOBAL $html_page_urls;
    if(array_count_values($html_page_urls)>0)
    {       
        foreach($html_page_urls AS $url)
        {
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);

            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();

            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);

            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();

            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');

            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Meta Name: ' .$meta_name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Meta Content: ' .$meta_content = $tag->getAttribute('content');  echo '<br>';
                    $html_page_meta_names[] = $meta_name;
                    $html_page_meta_descriptions[] = $meta_content;
                }
            }

            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
                $html_page_titles[] = $title;
            }

            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');

            for ($i = 0; $i < $title_tag->length; $i++) {
                echo 'Title: ' .$title = $title_tag->item($i)->nodeValue . "\n";
                $html_page_titles[] = $title;
            }
        }
    }
}

if(array_count_values($html_page_meta_names)>0)
{   
    print_r($html_page_meta_names);
    echo '<br>';
}

if(array_count_values($html_page_meta_descriptions)>0)
{   
    print_r($html_page_meta_descriptions);
    echo '<br>';
}

if(array_count_values($html_page_titles)>0)
{   
    print_r($html_page_titles);
    echo '<br>';
}

//END OF FUNCTIONS.


?>

borobhaisab 117 Posting Whiz · Answer 5 · 2023-05-28T15:51:41+00:00

ISSUE RESOLVED.

I replaced this:

$dom->loadXML($xml);

to this:

$dom->loadXML($xml->asXML());

with the help pf Bing AI and ChatGpt AI while trying out both.
Working now. No more errors.

Full code:

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);

// Preparing Crawler & Session: Initializing Variables.

// Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
// SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps = []; // This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods = []; // This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs = []; // This will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities = []; // This will list SiteMap pages priorities - found on Sitemaps.

// Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls = []; // This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods = []; // This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs = []; // This will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities = []; // This will list html pages priorities - found on Sitemaps.

// Step 1: Initiate Session - Feed Xml SiteMap URL. Crawling Starting Point.
$initial_url = "http://localhost/Work/buzz/Templates/0.xml";
$xml = simplexml_load_file($initial_url);
$dom = new DOMDocument();
$dom->loadXML($xml->asXML());

echo __LINE__ . '<br>';

crawl_sitemaps($xml);

foreach ($html_page_urls as $html_page_url) {
    echo __LINE__ . '<br>';
    scrape_page_data($html_page_url); // Extract Meta Data and Title from HTML page.
}

// END OF SCRIPT FLOW.

// FUNCTIONS BEYOND THIS POINT.

// Crawl SiteMaps.
function crawl_sitemaps($xml)
{
    global $sitemaps;
    global $html_page_urls;

    if ($xml->getName() === 'sitemapindex') {
        foreach ($xml->sitemap as $urlElement) {
            $sitemaps[] = $sitemap_url = (string)$urlElement->loc;
            $sitemaps_last_mods[] = $last_mod = (string)$urlElement->lastmod;
            $sitemaps_change_freqs[] = $change_freq = (string)$urlElement->changefreq;
            $sitemaps_priorities[] = $priority = (string)$urlElement->priority;

            echo 'sitemap_url: ' . $sitemap_url . '<br>';
            echo 'last_mod: ' . $last_mod . '<br>';
            echo 'change_freq: ' . $change_freq . '<br>';
            echo 'priority: ' . $priority . '<br>';

            echo '<br>---<br>';

            $sitemap_xml = simplexml_load_file($sitemap_url);
            crawl_sitemaps($sitemap_xml); // Recursively crawl nested sitemaps.
        }
    } elseif ($xml->getName() === 'urlset') {
        foreach ($xml->url as $urlElement) {
            $html_page_urls[] = $html_page_url = (string)$urlElement->loc;
            $html_page_last_mods[] = $last_mod = (string)$urlElement->lastmod;
            $html_page_change_freqs[] = $change_freq = (string)$urlElement->changefreq;
            $html_page_priorities[] = $priority = (string)$urlElement->priority;

            echo 'html_page_url: ' . $html_page_url . '<br>';
            echo 'last_mod: ' . $last_mod . '<br>';
            echo 'change_freq: ' . $change_freq . '<br>';
            echo 'priority: ' . $priority . '<br>';

            echo '<br>---<br>';
        }
    }

    echo 'SiteMaps Crawled: ---<br><br>';
    print_r($sitemaps);
    echo '<br><br>';

    echo 'HTML Pages Crawled: ---<br><br>';
    print_r($html_page_urls);
    echo '<br><br>';
}

// Meta Data & Title Extractor.
function scrape_page_data($html_page_url)
{
    $html = file_get_contents($html_page_url);

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($html, LIBXML_COMPACT | LIBXML_NOERROR | LIBXML_NOWARNING);
    libxml_clear_errors();

    $meta_tags = $doc->getElementsByTagName('meta');
    if ($meta_tags->length > 0) {
        foreach ($meta_tags as $tag) {
            echo 'Meta Name: ' . $meta_name = $tag->getAttribute('name') . '<br>';
            echo 'Meta Content: ' . $meta_content = $tag->getAttribute('content') . '<br>';
        }
    }

    $title_tag = $doc->getElementsByTagName('title');
    if ($title_tag->length > 0) {
        echo 'Title: ' . $title = $title_tag[0]->textContent . '<br>';
    }
}
?>

borobhaisab 117 Posting Whiz · Answer 6 · 2023-05-28T15:52:48+00:00

borobhaisab 117 Posting Whiz

2 Years Ago

@dani

You may close this thread.

DomDocument Parser Best Practice Questions

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers