Parsing Extremely large CSV file

Question

tgr0ss83 0 Light Poster

13 Years Ago

I have a CSV that i can parse no problem at 10am..

But by 5pm it does grow to over 100,000 lines that cuases the script to hang.

I have to parse the CSV into an array that i can then shape.

If anyone has any idea how i can parse this without the timeouts, or a better solution tahn fgetcsv that would be appreciated.

thank you,

Below is where it usually starts to hang around. I'm sure i'm just doing this very inefficiently, and there probably is a better solution.

require_once('../../Connections/dom/simple_html_dom.php');
#url :   http://mis.nyiso.com/public/csv/realtime/20111212realtime_gen.csv
$url = "http://mis.nyiso.com/public/csv/realtime/";  // The simple url
$name = "realtime_gen.csv";                        // The parameter name
$date = date("Ymd");

$newUrl = $url . "$date$name";        // appending the values
 
echo $newUrl;

// begin indexed array

	$data=array();
	$fp = fopen($newUrl, "r"); 
	$names = array_flip(fgetcsv($fp, 5000000)); 
	while (($values = fgetcsv($fp, 5000000)) !== FALSE) { 
	 #   print "The ".$values[$names["Name"]]." says ".$values[$names["Time Stamp"]].".\n"; 
	 $data[]=$values; 
	
	} 
	fclose($fp); 
# ----------- Explode time/date to proper format
	$explodeArr = array();
	$theData = array();
	$c = count($data);

	for($i = 0; $i < $c; $i++)
	{ 
		$explodeArr[] = explode(" ",$data[$i]['0']);
	
	unset($data[$i]['0']);
	

			$mergeArray = array_merge($explodeArr[$i],$data[$i]);unset($data[$i]);
				$theData[] = $mergeArray;

	}unset($data);

# ->>>>>>>>>>>>>>>>>>>>>>>>>>SECTION OFF GENERATORS TO OWN ARRAY
	 $nyc = array();
	$millwood = array();
	$hudvl = array();
	$dunwood = array();
	$c=count($theData);foreach ($theData as $i=>$v){

	
	
		
		if (preg_match("/(23512|23513|23515|23516|23517|23518|23519|23520|23523|23524|23533|23534|23535|23538|23540|23541|23657|23660|23729|23770|23786|23810| 23816|23817|23820|24077|24078|24079|24080|24084|24094|24095|24096|24097|24098|24099|24100|24101|24102|24103|24104|24105|24106|24107| 24108|24110|24111|24112|24113|24114|24115|24116|24117|24118|24119|24120|24121|24122|24123|24124|24125|24126|24127|24128|24129|24130| 24131|24132|24133|24134|24135|24136|24137|24138|24149|24152|24155|24156|24157|24158|24159|24160|24161|24162|24163|24168|24195|24196| 24202|24225|24226|24227|24228|24229|24230|24231|24232|24233|24234|24235|24236|24237|24238|24239|24240|24241|24242|24243|24244|24245| 24246|24247|24248|24249|24250|24251|24252|24253|24254|24255|24256|24257|24258|24259|24260|24261|323558|323559|323566|323567|323568|
	323569|323581|323582|323595|323610|323651|323677|323678|923512|923533|923568|924077|924094|924106|924149|924156|924157|924158|924160|924162|924228)/i",$theData[$i][3]))
		{
			#$nyc[$i][0]="N.Y.C.";
		/* 	$nyc[$i][0]=$theData[$i][0];
			$nyc[$i][1]=$theData[$i][1];
			$nyc[$i][2]=$theData[$i][4]; */
			$nyc[]=explode($theData[$i][0],$theData[$i][1],$theData[$i][4]);
		}
		if (preg_match("/(23530|23531|23653|23659|23776|24019|24139|24193|24198|323649)/i",$theData[$i][3]))
		{
			#$millwood[$i][0]="MILWD";
			$millwood[$i][0]=$theData[$i][0];
			$millwood[$i][1]=$theData[$i][1];
			$millwood[$i][2]=$theData[$i][4];
		}
		if (preg_match("/(23526|23586|23587|23588|23589|23590|23591|23592|23593|23595|23607|23608|23609|23610|23611|23612|23632|23639|23640|23641|
	23642|23654|23754|23765|23769|24000|24148|24192|323565|323602|323613|323627|323648|923586|923587)/i",$theData[$i][3]))
		{
			#$hudvl[$i][0]="HUD VL";
			$hudvl[$i][0]=$theData[$i][0];
			$hudvl[$i][1]=$theData[$i][1];
			$hudvl[$i][2]=$theData[$i][4];
		}
			if (preg_match("/(23655|24194|323650)/i",$theData[$i][3]))
		{
			#$dunwood[$i][0]="DUNWOD";
			$dunwood[$i][0]=$theData[$i][0];
			$dunwood[$i][1]=$theData[$i][1];
			$dunwood[$i][2]=$theData[$i][4];
			
		}
		else {
			unset($theData[$i]);
		}

	}

There is more to the code, but I guess it doesn't matter if i can't get past here without a timeout and/or out of memory. Yes i've tried increasing memory, it doesn't help.

php

4 Contributors
13 Replies
4K Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by cereal

All 13 Replies

diafol

13 Years Ago

From where does the data for the csv come? Is it user input via form? Do you generate it manually from a spreadsheet?
How is the csv uploaded? Or is it created on the server?

What about splitting the file into smaller ones, e.g.
http://stackoverflow.com/questions/5039824/php-script-to-split-large-text-file-into-multiple-files

Then use ajax to extract each one in turn - using ajax may prevent a timeout. Just reaching here, I may be well off the mark.

Edited 13 Years Ago by diafol because: n/a

diafol

13 Years Ago

DON@T CLICK THE LINK!!!
Being curious George I ended up downloading a file in XS of 8 meg!

Anyway. I noticed that the entries were every 5 minutes.
How about running a script (cronjob??) every 30 minutes to deal with the file which then gets emptied.

//STOP

I just noticed entries at 7 and 9 minutes past... SO that won't work.

I'm going to bed, but will come back tomorrow. Anybody else?

diafol

13 Years Ago

does the data get checked before inserting into the file?
if so you could give it 30 minute increments:

csv_1000.csv, csv_1030.csv, csv_1100.csv etc.

Then at a predefined time, possibly using a cron job, you could load the data, file by file and once loaded, either delete them or empty them.

The way I see it:

1. get feed line - check timestamp
2. if timestamp between certain times, append to particular file.
3. At certain times of the day, e.g. 03.00, 06.00,09.00,12.00... run a cron job to load files (03.00 will load 00.00, 00.30, 01.30, 02.00, 02.30) etc

If sure the above doesn't fit your idea perfectly, but you get the idea.

Again, this is not something with which I have a great deal of experience, so there may be a much easier solution of which I am unaware.

pritaeas 2,211 ¯\_(ツ)_/¯

13 Years Ago

Am not sure if it is possible in your situation, but if your CSV URL supports fseek , you could theoretically parse the CSV line by line ( fgets ), you just have to remember at what position you are ( ftell ).

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

tgr0ss83 0 Light Poster · Answer 1 · 2011-12-14T06:00:47+00:00

http://mis.nyiso.com/public/csv/realtime/20111212realtime_gen.csv

It's real time data, all the generators. I then need to group their generators by the proper zone and average to get the correct number.

What i was thinking is i dont need all of the data. There are 3 columns i can skip right off the bat. I'm wondering if i ignore those columns, this should parse faster no?

I'm just trying to figure out how to do it:)

tgr0ss83 0 Light Poster · Answer 2 · 2011-12-14T06:08:35+00:00

I think you're on the right Track, the problem is this csv updates every 5 minutes. and all the intervals are very important if i am to report the data accurately.

tgr0ss83 0 Light Poster · Answer 3 · 2011-12-14T06:44:21+00:00

Lol sorry about that. .. Now it seemed to parse if I unset the values I didn't need before I through them in an array, I'm wondering if I can just skip those columns alltogerher and not parse them. In theory this would cut the data
Being pulled in half, in turn greatly improving the processing time. While I'm still a novice and
This is all theory--- please correct me of my thought pattern is wrong

tgr0ss83 0 Light Poster · Answer 4 · 2011-12-14T21:21:11+00:00

Yes this could work, There is a separation of 3 files, where i could run a If statement regarding time of day then to use the desired url to get, then continue to parse. then at the end of the day just truncate the table.

I'm also working with another fgetcsv statement, i think the current one I'm using is also creating a problem.

I'll come back in a few after i try this hypothetical solution.

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 5 · 2011-12-14T21:52:57+00:00

I made few tests. The first one is faster than the second, but still eats a lot of memory (~170MB in ~14 seconds):

<?php
$file = '20111212realtime_gen.csv';

$f = fopen($file, "r");
$ln= 0;

$nyc = array();
$millwood = array();
$hudvl = array();
$dunwood = array();


$arr = array('nyc' => array(23512,23513,23515,23516,23517,23518,23519,23520,23523,23524,23533,23534,23535,23538,23540,23541,23657,23660,23729,23770,23786,23810, 23816,23817,23820,24077,24078,24079,24080,24084,24094,24095,24096,24097,24098,24099,24100,24101,24102,24103,24104,24105,24106,24107, 24108,24110,24111,24112,24113,24114,24115,24116,24117,24118,24119,24120,24121,24122,24123,24124,24125,24126,24127,24128,24129,24130, 24131,24132,24133,24134,24135,24136,24137,24138,24149,24152,24155,24156,24157,24158,24159,24160,24161,24162,24163,24168,24195,24196, 24202,24225,24226,24227,24228,24229,24230,24231,24232,24233,24234,24235,24236,24237,24238,24239,24240,24241,24242,24243,24244,24245, 24246,24247,24248,24249,24250,24251,24252,24253,24254,24255,24256,24257,24258,24259,24260,24261,323558,323559,323566,323567,323568,323569,323581,323582,323595,323610,323651,323677,323678,923512,923533,923568,924077,924094,924106,924149,924156,924157,924158,924160,924162,924228),
'millwood' => array(23530,23531,23653,23659,23776,24019,24139,24193,24198,323649),
'hudv1' => array(23526,23586,23587,23588,23589,23590,23591,23592,23593,23595,23607,23608,23609,23610,23611,23612,23632,23639,23640,23641,23642,23654,23754,23765,23769,24000,24148,24192,323565,323602,323613,323627,323648,923586,923587),
'dunwood' => array(23655,24194,323650));

#while($line = fgets($f)) { # both give same results and similar performance
while($line = stream_get_line($f,10000,"\n")) {
    $ln++;
    $b = explode(',',$line);
    
    if(in_array($b[2],$arr['millwood']))
    {
        $millwood[] = array($b[0],$b[1],$b[2],$b[3],$b[4],$b[5]);
    }
    
    elseif(in_array($b[2],$arr['hudv1']))
    {
        $hudv1[] = array($b[0],$b[1],$b[2],$b[3],$b[4],$b[5]);
    }
    
    elseif(in_array($b[2],$arr['dunwood']))
    {
        $dunwood[] = array($b[0],$b[1],$b[2],$b[3],$b[4],$b[5]);
    }
    
    else
    {
        $nyc[] = array($b[0],$b[1],$b[2],$b[3],$b[4],$b[5]);
    }
    
    if($ln == 7500)
    {
        $ln = 0;
        usleep(500000);
    }
}
fclose($f);
echo "done\n";
echo memory_get_peak_usage(); # you can remove this
?>

Here I "split" the parsing sequence so the script can run using less memory, ~60MB in ~48seconds:

<?php
ini_set('memory_limit', '64M'); # you can remove this

function my_parser($f,$codes,$var)
{
    $array = array();
    $ln = 0;
    while($line = stream_get_line($f,10000,"\n")) {
        $ln++;
        $b = explode(',',$line);
        
        if(in_array($b[2],$codes[$var]))
        {
            $array[] = array($b[0],$b[1],$b[2],$b[3],$b[4],$b[5]);
            #$array[$b[0]][$b[1]][$b[2]] = array($b[3],$b[4],$b[5]); # ~50MB instead of 70MB
        }
        
        if($ln == 7500)
        {
            $ln = 0;
            usleep(500000); # half second break each 7500 lines, be kind with CPU
        }
    }
    
    return $array;
}

$file = '20111212realtime_gen.csv';
$arr = array('nyc' => array(23512,23513,23515,23516,23517,23518,23519,23520,23523,23524,23533,23534,23535,23538,23540,23541,23657,23660,23729,23770,23786,23810, 23816,23817,23820,24077,24078,24079,24080,24084,24094,24095,24096,24097,24098,24099,24100,24101,24102,24103,24104,24105,24106,24107, 24108,24110,24111,24112,24113,24114,24115,24116,24117,24118,24119,24120,24121,24122,24123,24124,24125,24126,24127,24128,24129,24130, 24131,24132,24133,24134,24135,24136,24137,24138,24149,24152,24155,24156,24157,24158,24159,24160,24161,24162,24163,24168,24195,24196, 24202,24225,24226,24227,24228,24229,24230,24231,24232,24233,24234,24235,24236,24237,24238,24239,24240,24241,24242,24243,24244,24245, 24246,24247,24248,24249,24250,24251,24252,24253,24254,24255,24256,24257,24258,24259,24260,24261,323558,323559,323566,323567,323568,323569,323581,323582,323595,323610,323651,323677,323678,923512,923533,923568,924077,924094,924106,924149,924156,924157,924158,924160,924162,924228),
'millwood' => array(23530,23531,23653,23659,23776,24019,24139,24193,24198,323649),
'hudv1' => array(23526,23586,23587,23588,23589,23590,23591,23592,23593,23595,23607,23608,23609,23610,23611,23612,23632,23639,23640,23641,23642,23654,23754,23765,23769,24000,24148,24192,323565,323602,323613,323627,323648,923586,923587),
'dunwood' => array(23655,24194,323650)
);

$a = array('nyc','millwood','hudv1','dunwood');
foreach($a as $row)
{
    $f = fopen($file, "r");
    $varname = $row;
    $res = my_parser($f,$arr,${$varname} = $row);
    ${$varname} = $res;
    fclose($f);
}

# ${$varname} creates array variable names from $a array.

echo 'nyc: ' . count($nyc) . "\n";
echo 'millwood: ' . count($millwood) . "\n";
echo 'hudv1: ' . count($hudv1) . "\n";
echo 'dunwood: ' . count($dunwood) . "\n";
echo memory_get_peak_usage(); # you can remove this
echo "\n";
?>

Hope it helps, bye :)

tgr0ss83 0 Light Poster · Answer 6 · 2011-12-14T23:08:04+00:00

wow, nice. Input is much appreciated, from everyone
Thanks a ton.

@prit > i dont knwo if that's what i'm going for, but ill definitely check ((thanks for the suggestion);)

Cereal > what you have done is incredible, now i must spend the proper time to fully Understand it.

I have also been Tinekring some with SplFileObject, as it does seem to parse 10x faster.

My desired final product will only consist of a few values. i just realized after looking at cereal's code, that i have been going about this all wrong.

thanks ardav for putting me on the right track and cereal to making me realize that i'm going about this all wrong :0

Anyway, ill come back in a bit. Hopefully my next post is a final product that is working in full.
I am running this code now on a FULL day's CSV which is 139k lines. (starts at about 500?) ..

thanks again,

tgr0ss83 0 Light Poster · Answer 7 · 2011-12-15T21:13:26+00:00

just wanted to dropi n and supply an update. I was able ot find the data in 5 minute intervals, so i can pull that no problem.

If i average the values before i put them in the database, i can not worry about having 200k records to truncate nightly.

Ceareal> your code is great, does what it needs to do. Just made a few minor adjustments to preg_split on the first space and every comma instead of exploding, so that i can then have the Date and Time broken into different parts, since I need to take an average of time 0-6:55 and 2300-0, as well as hours 7-22:55 to determine the on/off peak averages.

Anyway thank you to everyone who contributed, My headache is gone now:)

tgr0ss83 0 Light Poster · Answer 8 · 2011-12-15T21:14:39+00:00

For some reason my posts are not posting? can you see this?

edit*** nm i didn't realize there was a page 2, was just thinking it was one giant thread.

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 9 · 2011-12-15T22:25:17+00:00

cereal 1,524 Nearly a Senior Poster

13 Years Ago

My headache is gone now:)

that's great, bye ;D

Parsing Extremely large CSV file

Recommended Answers Collapse Answers

All 13 Replies

Recommended Answers