Start New Discussion within our Web Development Community

Hi
I have XML file that appends on the end of the file:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head><title>

    </title></head>
    <body>
        <form name="form1" method="post" action="GetProductsXML.aspx?username=UASERNAME&amp;password=PASSWORD" id="form1">
    <div>
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE2MTY2ODcyMjlkZC/1D4iGqP0urqyxWR+2OEQ90eHf" />
    </div>

        <div>

        </div>
        </form>
    </body>
    </html>

I am using this function:

$xml_url= 'http://b2b.domain.com/GetProductsXML.aspx?username=USERNAME&password=PASSWORD';
        $xml = simplexml_load_file(utf8_encode($xml_url), 'SimpleXMLElement', LIBXML_NOCDATA);

How can I filter extra content from this XML? When I open it in web browser I get HTML page with text

How can I filter extra content from this XML? When I open it in web browser I get HTML page with text

Was there an error when you load the XML?

There is no error when I load it in broswer but I do not get XML but HTML document because of that code in the end of this XML file.
It is generated with that code in the end of the file and because of that I cannot read it like XML, so I need to strip that par somehow to read it like XML if it is possible.

I think need to adjusted your $xml_url. The reason why because it's not letting you read the XML.

As in that article you posted, my syntax is exatcly the same, problem is not in reading XML but this XML has extra HTML as I posted above.
Example:

</ProductDescription>
<ImageLarge>http://domain.com/images/products/KOMNET201_inf.jpg</ImageLarge>
<ImageSmall>http://domain.com/images/products/KOMNET201_kat.jpg</ImageSmall>
<BarCode>6935364052034</BarCode>
<ProducerWebPage>http://www.tp-link.com/en</ProducerWebPage>
<ProductWebPage>http://www.tp-link.com/en/products/prodetail.aspx?mid=0103030106&amp;id=541</ProductWebPage>
<Warranty>12 mj.</Warranty><CategoryName>Antene i dodatna oprema</CategoryName>
<ParentCategoryName>Mrežna oprema</ParentCategoryName>
<RowNumber>436</RowNumber><NetoPrice>95,93</NetoPrice>
<ProductDescriptionShort>ohms nominal, VSWR: 1.92 max., cable 1m, SMA</ProductDescriptionShort>
<AvailableQuantity>0</AvailableQuantity>
<InfoWindowLink>http://domain.com/ProductInfo.aspx?ProductID=53299</InfoWindowLink>
<Producer>TP-LINK</Producer></Product><Product>
<IsActiveRetail>true</IsActiveRetail>
<SortOrderRetail>16506</SortOrderRetail>
<SortOrderHomePageRetail>100</SortOrderHomePageRetail>
<ProductID>370770</ProductID>
<ProductCode>KOMNET272</ProductCode>
<ProductName>ANTENA TL-ANT2412D</ProductName>
<ProductDescription />



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">



<html xmlns="http://www.w3.org/1999/xhtml" >

<head><title>



</title></head>

<body>

    <form name="form1" method="post" action="GetProductsXML.aspx?username=domain.com&amp;password=89" id="form1">

<div>

<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE2MTY2ODcyMjlkZOOJeh0Tms5Udbf1jSVwRpTz4gUg" />

</div>



    <div>



    </div>

    </form>

</body>

</html>

How to exclude HTML from XML?

It is generated with that code in the end of the file and because of that I cannot read it like XML, so I need to strip that par somehow to read it like XML if it is possible.

If you mention you can't read the XML but now you can?

How to exclude HTML from XML?

You just don't want the HTML tags appear?

I don't get.

XML file is separate file.
HTML file read the XML.
You don't put XML with HTML in 1 file.

The problem is that that is the "XML" which is given to me but it has html tags in it, so I cannot read it like XML, I need to find a way to exclude that tags when reading this so called XML

The problem is that that is the "XML" which is given to me but it has html tags in it, so I cannot read it like XML, I need to find a way to exclude that tags when reading this so called XML

This:

$xml_url= 'http://b2b.domain.com/GetProductsXML.aspx?username=USERNAME&password=PASSWORD';
$xml = simplexml_load_file(utf8_encode($xml_url), 'SimpleXMLElement', LIBXML_NOCDATA);

Take everything except:

$xml = simplexml_load_file('GetProductsXML.xml');

I want to know can you load the GetProductsXML.xml without any issue?

If you can then there's no issue with reading the file.

Then the issue is has something to do with this:

$xml_url= 'http://b2b.domain.com/GetProductsXML.aspx?username=USERNAME&password=PASSWORD';

If there's an issue reading the GetProductsXML.xml that will tell you that you have a issue reading the GetProductsXML.xml file.

That is the problem because it canot read it as xml because extra html data in it

Honestly, not shure, pearson who did that said that it is OK, and it should look like that :). Because for him this is good.
This is what I have and have to find a way to deal with it :)

That I know, but I cannot do anything in this case, just remove it if possible?

That I know, but I cannot do anything in this case, just remove it if possible?

I would, but, you could however read the file into a string and then remove the html part, and use the remainder in simplexml_load_string().

I have tried to exclude it in a string but with no luck, I always get something from hmtl

return preg_replace('~<(?:!DOCTYPE|/?(?:html|body))[^>]*>\s*~i', '',$retValue);

Also something like this:

$nedozvoljeno1 = array('<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">      <html xmlns=""http://www.w3.org/1999/xhtml"" >     <head><title>      </title></head>     <body>         <form name=""form1"" method=""post"" action=""GetProductsXML.aspx?username=UASERNAME&amp;password=PASSWORD"" id=""form1"">     <div>     <input type=""hidden"" name=""__VIEWSTATE"" id=""__VIEWSTATE"" value=""/wEPDwULLTE2MTY2ODcyMjlkZC/1D4iGqP0urqyxWR+2OEQ90eHf"" />     </div>          <div>          </div>         </form>     </body>     </html>     <!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">      <html xmlns=""http://www.w3.org/1999/xhtml"" >     <head><title>      </title></head>     <body>         <form name=""form1"" method=""post"" action=""GetProductsXML.aspx?username=UASERNAME&amp;password=PASSWORD"" id=""form1"">     <div>     <input type=""hidden"" name=""__VIEWSTATE"" id=""__VIEWSTATE"" value=""/wEPDwULLTE2MTY2ODcyMjlkZC/1D4iGqP0urqyxWR+2OEQ90eHf"" />     </div>          <div>          </div>         </form>     </body>     </html>');

                return str_replace($nedozvoljeno1, "", $retValue);

Maybe some new ideas?

This question has already been answered. Start a new discussion instead.