1.11M Members

Parsing large CSV

 
0
 

Hi,

I have quite a large CSV file (Around 20,000 rows with about 20 columns) that I am trying to manipulate. Initially I am looking for a way to get out the first 10 or so records after they have been sorted in ascending order on one of the numeric fields. I've thought of loading the file into memory and then sorting it from there using one of PHPs sort functions but as expected, the memory is simply exhausted. My current thoughts on this would be to read line by line attempting to place each record into an array (Which will be limited to 10), by comparing it to the "sort field" of the other records in the array. Before getting too far into writing all of this, I thought it would be wise to see if anyone else had any suggestions or a better approach to this problem.

Cheers.

 
0
 

There are probably better ways, but I would make a temporary database with mysql and store them in there for easier manipulation. I would write a script to insert a few hundred at a time, and add the records gradually to avoid PHP execution limit, with an echo for every insert to show me that it is still running.
Then I would run the query:
"SELECT * FROM table ORDER BY column ASC, limit 10"

 
0
 

If you have shell like ksh or bash you can manipulate your file with sort and head or tail commands ...
You can also use awk !

 
0
 

Are you using file_get_contents()? If so, have you tried using fopen() followed by fgets() or fgetcsv()? I haven't tried it myself, but from what I've read fopen() doesn't actually load the file into memory, and you can use fgets() or fgetcsv() to get one line at a time:

When you call fopen, nothing of the file is read into memory, it only creates a descriptor of the file you want to work on. Included in this descriptor is a pointer to a position in the file which is an integer type, and must be able to point to any byte in the file. So if the file has more bytes than the integer can address, fopen fails because it cannot address the entire file.

From http://ubuntuforums.org/archive/index.php/t-586975.html (see post #5)

 
0
 

Hi,

Creating a MySQL table with the CSV file is something I had considered before but didn't think it would be the best approach. The CSV file can change multiple times per day, so unless I can figure out a way of making sure the database table is always up to date with the file.

I'll take a look at using the shell and I'll have a look at AWK too.

I am using fopen(), which just opens a stream to the file and them I am using fgetcsv() to read line by line. I know this doesn't load the file into memory, what I meant was that in order to sort the file I had initially thought of loading it into memory and using PHPs sorting functions but it wasn't an option because of the size of the file.

Thanks for your responses. If anyone has any other suggestions please let me know!

Cheers.

 
0
 

I'd go with the DB myself. Why is this CSV being bandied about though? How is the data being used? Where is the CSV being stored? What advantage does storing the data in csv format give you? Just questions - not accusations! Sorting 20K rows with MySQL is easy. Slightly more of an issue with files!

 
0
 

My situation is that a third party will generate a CSV flat file which is essentially a cache of live product data (The third party only offers this in CSV format) and upload it to a location on my webserver via FTP multiple times per day. What I am attempting to create is a web application that allows a user to search, filter and sort through this CSV file for speedy access to the data (Accessing the live data is quite a slow process). Of course, given the choice, I would prefer this cached data to go straight into a database but it's not something the third party supports.

It does seem like the database would be the best option if it is possible to keep the data within the file synchronised with the database. Do you have any suggestions on how this could be achieved?

Thanks for your responses!

 
0
 

Do you think the following approach would be wise to consider?

  • Before the main bulk of the application begins, check the timestamp of the CSV file using filemtime().
  • Compare this timestamp with the contents of another file (dbtimestamp.txt) which will be used to hold the "last modified time" of the database table.
  • If these two timestamps are the same, we can continue to use the database as normal, since the files will be in sync.
  • If these two timestamps are different, empty the database and re-fill it with the contents of the CSV file
  • Update the dbtimestamp.txt file with the timestamp returned by filemtime() which will tell us in the future that the file and database table are in sync.

Thanks

 
0
 

Ideally this process should be automated, so as a file arrives in a folder, a trigger sets the ball rolling and does all these updates for you. I can't imagine the time spent on fussing with this :(
You have a few options here - you could get a daemon to monitor your upload folder if runnign Unix. Also, you could run a cron job every minute or so - depends on the resources used. php can also have support for inotify - maybe worth a look.

With regard to your 20,000 record update. Sounds major.

Would having the new records be inserted to a new table be better?

  1. create a new table with the various fields (SQL query) called new_table
  2. once complete, rename table to old_table
  3. rename new_table to table
  4. drop old_table

I'm reaching here as I've never had to do massive inserts like this. I'll have a look for alternatives, unless somebody else jumps in (hopefully!).

/?EDIT - the inserts should be via LOAD INFILE as opposed to manual insert: http://dev.mysql.com/doc/refman/5.1/en/load-data.html

 
1
 

Forget my nonsense with the 1-4 steps. I replicated a 20000 record 20 field (ints) csv and tried the LOAD FILE. It completed in about a second. Obviously it will be longer for more complex data.

$q = mysql_query("DELETE FROM csvholder");
$q = mysql_query("LOAD DATA INFILE 'c:/xampp/htdocs/treialon/big.csv' INTO TABLE csvholder FIELDS TERMINATED BY ','")or die('it broke!');

That did it for my Windows setup.

 
0
 

I once put all the UK postcode blocks with their accompanying longitudes/latitudes and northings/westings into a MySQL database, I dont remember exactly how many records, but I think it was more than 20,000. I set up a script that would insert 10 to each query, and 100 queries, and the page would then reload and go on with the next 100X10 block. The whole thing took over 15 mins to complete, but it was all there at the end of it.
Would you really need to recreate the whole database everytime? Could you not only take the new entries to the CSV file since your last update? Maybe check DB for last entry, search CSV for that entry, and grab all previous lines and insert them to DB? How many entries are likely to have backlogged between each user visit? Could the user visit itself not trigger the update process? How often is the CSV file written to? Is data being overwritten or appended?
I love challenges like this, ones you can get your teeth into, rather than making a few bits of text dance :D

 
0
 

Hi and thanks again for all the suggestions.

I had thought of doing a cron job before but really wanted to keep the full solution in php with no reliance on external functions in order to be as portable as possible. The same thing applies with the suggestion of using a daemon to monitor the folder and trigger an update. Of course I will look at implementing these if need be.

I have had a little mess around using LOAD DATA and that seems to be pretty useful. At the moment I am trying to get my third party to change the formats of some of the fields, particularly the datetime fields as they are currently not in a SQL friendly format and therefore not being correctly inserted into the database when the LOAD DATA statement is executed. Should I manage to get what I need from my third party I am looking into writing a solution that will, as I described above, keep a "timestamp" file with a record of the last database update and compare it to the last modified attribute of the CSV file to decide whether or not to update the table. I am hoping to use DELETE FROM and LOAD DATA to import the file into the database should it need to be updated. It did, as you mentioned only take around 1 second to import the full file into the database which is much better than I expected (In my initial post I made a typo when I wrote 20,000 records. The actual number of records is close to 200,000)!

Adam: unfortonately the nature of the data I am working with isn't as simple as a bunch of products. I am working with packaged holidays of which different parts come from different suppliers and as you might imagine, each product can be quite complex and the data can change considerably in each update. Therefore updating only the records that have changed would be a difficult task, especially considering there are no primary keys in the data! I'd imagine this would take probably longer as well to do as much comparing and processing than the LOAD DATA function would take to insert the whole lot. As I mentioned in the last paragraph, I am hoping to make a visit to the page trigger the update process if I can manage to sort it. The CSV file will be uploaded to my server at regular intervals on a daily basis, I think about 3-4 times. The file will be completely overwritten. You're not wrong there, nothing like a good challenge, I get sick of writing HTML day in day out!

I'll post back here to let you know how it goes, and if it goes well possibly post some code up for anyone who might need help with such a situation in the future.

Cheers!

 
0
 

Hi again,

I've managed to get something working using the method I described above (Checking for timestamps and the last modified time attribute of the CSV and using LOAD DATA to insert data from the CSV into the data table). See below for an example of the code:

<?php

date_default_timezone_set("Europe/London");

$start = time();

define("CSVFILENAME", dirname(__FILE__) . DIRECTORY_SEPARATOR . "data" . DIRECTORY_SEPARATOR . "datafile.csv");
define("TIMESTAMPFILENAME", dirname(__FILE__) . DIRECTORY_SEPARATOR . "data" . DIRECTORY_SEPARATOR . "timestamp.txt");

define("DBHOST", "localhost");
define("DBNAME", "databasename");
define("DBUSER", "username");
define("DBPASS", "password");


// Connect to the database.
$db = new mysqli(DBHOST, DBUSER, DBPASS, DBNAME);
if ($db->connect_errno) {
    die("Database connection failed: " - $db->connect_error);
}


// Get the database table timestamp from the timestamp file.
$timestamp = 0;
if (file_exists(TIMESTAMPFILENAME)) {
    $fp = fopen(TIMESTAMPFILENAME, "r");
    $timestamp = fread($fp, 100);
    fclose($fp);
}
echo "Database timestamp: " . $timestamp . "<br>";


// Get the last modified attribute of the CSV file.
$csvLastModified = 1;
if (file_exists(CSVFILENAME)) {
    $csvLastModified = filemtime(CSVFILENAME);
}
else {
    die("No data file");
}
echo "CSV last modified: " . $csvLastModified . "<br>";


// The last database update (timestamp) and the last modified time of the CSV
// are not the same. Update the database and then update the timestamp file
// with the last modified time of the CSV file.
if ($timestamp != $csvLastModified) {    
    echo "Updating database<br>";
    if ($db->query("DELETE FROM `table`;") === TRUE) {
        echo "Deleted all records<br>";
        if ($db->query("LOAD DATA INFILE '" . CSVFILENAME . "' INTO TABLE `table` FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"';")) {
            echo "Loaded file into table<br>";
            if (($fp = fopen(TIMESTAMPFILENAME, "w")) !== FALSE) {
                echo "Updating timestamp file<br>";
                fwrite($fp, strval($csvLastModified));            
                fclose($fp);
            }
            else {
                echo "Problem opening timestamp file<br>";
            }
        }
        else {
            echo "Problem loading file into table<br>";
            echo $db->error . "<br>";
        }
    }
}
else {
    echo "Using current datatable<br>";
}


echo time() - $start . " seconds<br>";

?>

I did run into a problem getting LOAD DATA to work because the user of the database did not have the global FILE permission. Other than that, it works pretty well, the database is only updated when the last modified time of the CSV file changes and thanks to LOAD DATA it doesn't take long at all update.

Thanks a lot for your help and suggestions!

Question Answered as of 1 Year Ago by diafol, adam.adamski.96155, EvolutionFallen and 1 other
You
This question has already been solved: Start a new discussion instead
Post:
Start New Discussion
View similar articles that have also been tagged: