0

I'm build a huge xml->mysql data importer ar parser from outer server.

And I need a script example, what to do with big

$xml = simplexml_load_file('plane.xml');
if($xml) {
	for($lvl1=0; $lvl1<(count($xml->Trip)); $lvl1++) {		
		for($lvl3=0; $lvl3<(count($xml->Trip[$lvl1]->Hotels->Hotel)); $lvl3++) {
	// Fill an array with the travel data
	$element['trip_id'] 	= (string)$xml->Trip[$lvl1]['id'];
	$element['description'] = (string)$xml->Trip[$lvl1]->Description;
	$element['hotel_name'] 	= (string)$xml->Trip[$lvl1]->Hotels->Hotel[$lvl3]->HotelName;
	$element['hotel_stars'] = (string)$xml->Trip[$lvl1]->Hotels->Hotel[$lvl3]->HotelStars;
}

All is working when I do a 1-3 MB XML import.

The problem is, what I must do 5-6 ~80-90 MB XML files import to sql. So, first of all - it makes website lagg when website trying to donwload the file, and overload server when it takes all the file I cache, as I understand?

The website from which I must Import XML's, also providers a xml.gz files, they all fits in ~2-3MB instead of ~80-90MB.

But I don't know how to rewrite my script, or need a help with example of 1 Object input from .gz file.

Php.net provides a .gz decompression scripts(for reading by size, or by every line as strings):

<?php
// get contents of a gz-file into a string
$filename = "/usr/local/something.txt.gz";
$zd = gzopen($filename, "r");
$contents = gzread($zd, 10000);
gzclose($zd);
?>

Also on IBM.com I had found an example of part reading:

// set up the request
    $request = $HTTP_RAW_POST_DATA;
    error_reporting(E_ERROR | E_WARNING | E_PARSE);
    // create the reader
    $reader = new XMLReader();
    $reader->setRelaxNGSchema("request.rng");
    $reader->XML($request);

    $input = "";
    while ($reader->read()) {
      if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {

          while ($reader->read()) {
            if ($reader->nodeType == XMLReader::TEXT
              || $reader->nodeType == XMLReader::CDATA
              || $reader->nodeType == XMLReader::WHITESPACE
              || $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
               $input .= $reader->value;
            }
            else if ($reader->nodeType == XMLReader::END_ELEMENT
              && $reader->name == "double") {
                break;
            }
          } 
          break;
      }
    } 
    $reader->close();

But how it make work with my code, and how the source should look like, If I need to read a XML.GZ

Thanks for any help!

Edited by Xamas: n/a

2
Contributors
2
Replies
3
Views
7 Years
Discussion Span
Last Post by digital-ether
0

I've wrote a code, which available to parse the ~80 MB XML file:

$counter2=$counter3 = 0;
$reader = new XMLReader();
$reader->open('prices.xml');
while ($reader->read()) {
	switch ($reader->nodeType) {
		case (XMLReader::ELEMENT):
		if ($reader->depth == 3 && $reader->localName == "Price") {
			$node = $reader->expand();
			$dom = new DomDocument();
			$child = $dom->importNode($node,true);
			$dom->appendChild($child);
			$simple_xml = simplexml_import_dom($child);
			$element['price_id'] = $simple_xml['id'];
			$element['date'] = $simple_xml->Date;
			$element['price'] 	= $simple_xml->Value;
			$counter3++;
			do_query("INSERT INTO prices (hotel_id,price_id,date,price) VALUES ('".$element['hotel_id']."','".$element['price_id']."', '".$element['date']."', '".$element['price']."')");
		}
			
		case (XMLReader::ATTRIBUTE):
		if ($reader->depth == 1 && $reader->localName == "Hotel") {
			$element['hotel_id'] = $reader->getAttribute("id");
			$counter2++;
		}	
	}
}
echo "<br />ACTS: $counter2 $counter3";

But the problem is THAT, I HAD TO edit php.ini and set max_query_execution_time to 120. What means - 2 minuttes of fully loaded CPU. Thanks god, that in this type of code(when I used pointers instead of cache all to RAM), it doesn't requires ~1GB or more of memory, but, as I said - it's innormal, that I must edit max_execution_time.

Also I had an idea, to write some ajax code, and save the xml file pointer outer of ajax code, so that would be mean, that pages dinamically reloads over ~30 times, and does take long one execution.

But first of all, this idea should be programmed - second thing - IS THAT ONLY WAY TO SOLVE THIS PROBLEM???

Becouse I'll need to write a script that parse FIVE ~80MB Files, so it's about ~400MB, and it takes full load of my server for ~12-15 minutes to fininish the process. Even if it will be written with ajax.

Is there is any way to make this query work as background process in server(not use more than 10-15% CPU resourse of VPS).

Or maybe there is other solutions for solving this problem?

0

Try benchmarking the xml parsing process:

You can simply use:

$start = microtime(true);
// do a few 
$lapse = microtime(true)-$start;

and:

$mem = memory_get_usage(true);

To get the memory used and exec time. Stop after just a few loops so you can analyze.

How well does XMLReader manage memory?

The other thing is having to import XMLReader nodes into DomDocument then into simplexml. I'm sure you can do everything in XMLReader, so you don't have that overhead.

XMLReader shouldn't use a lot of memory, but, if it does, you may try "XML Parser" and compare performance. http://www.php.net/manual/en/book.xml.php

As for reading directly from Gzip, you can do that with the gzip stream wrapper, compress.zlib:// , but I think that would affect performance adversely.

Is this process being invoked from a PHP page load? if so then you probably want to use cron.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.