954,500 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

The Fastest way to read a .txt File

I am reading Comma delimited Large .txt files(About 50 Mb).
Currently I am using the method below to step through the lines in the file.
I have one other application that Read the exact same .txt file that I do.
That application will reach the end of the textFile in 5 Seconds while my method below will do it in 50 seconds. (I dont know what method that application uses)

So what I wonder is if there is a more effective way to read the txtFile than I do.
I have heard and red around that open the file in binary mode will be more efficient but I dont know the method to do this and what data I will get from ifstream.
The lines in the txt file that I read look like this:

Monday,1,2
Tuesday,2,3
Wednesday,3,4

std::string Text1;
double Number1 = 0;
doulbe Number2 = 0;	
char Comma;

	ifstream LargeFile("C:\\LargeFile.txt");

	while( getline(LargeFile, Text1, ',') )		
	{						
	     LargeFile >> Number1;		        
	     LargeFile >> Comma;                   
	     LargeFile >> Number2;			
	     LargeFile.get();	
	}
	MessageBox::Show("File has Reached End");
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

The complete example should look like this instead of the previous post.
I both Read from and Write to a .txt File.

std::string Text1;
double Number1 = 0;
doulbe Number2 = 0;	
char Comma;

ofstream OutPut;
OutPut.open("C:\\OutPut.txt");

	ifstream LargeFile("C:\\LargeFile.txt");

	while( getline(LargeFile, Text1, ',') )		
	{						
	     LargeFile >> Number1;		        
	     LargeFile >> Comma;                   
	     LargeFile >> Number2;			
	     LargeFile.get();	

              OutPut << Text1 << ',' << Number1 << ',' << Number2 << '\n';
	}

	MessageBox::Show("File has Reached End");std::string Text1;
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

Try C style implementation. I don't know if its faster or slower so you will have to test with your huge file.

int main ()
{
    char text[80];
    int n1, n2;
  FILE* fp = fopen("..\\TextFile1.txt", "r");
  if(fp)
  {
    while( fgets(text, sizeof(text), fp) )
    {
        char* p = strtok(text,",");
        p = strtok(NULL, ",");
        n1 = atol(p);
        p = strtok(NULL, ",");
        n2 = atol(p);
       cout << text << " " << n1 << " " << n2 << "\n";
    }
  }
  fclose(fp);
  return 0;
}
Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

Since you are not doing anything with those integers except output them to another file, there is no reason to convert them from char* to int.

int main ()
{
    char text[80];
    int n1, n2;
  FILE* fp = fopen("..\\TextFile1.txt", "r");
  if(fp)
  {
    while( fgets(text, sizeof(text), fp) )
    {
        if(text[strlen(text)-1] == '\n')
            text[strlen(text)-1] = 0;
        char* p1 = strtok(text,",");
        char* p2 = strtok(NULL, ",");
        char* p3 = strtok(NULL, ",");
       cout << p1 << "," << p2 << "," << p3 << "\n";

    }
  }
  fclose(fp);
  return 0;
}
Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

What about using fread() to read a large buffer and then write it out ?

stilllearning
Posting Whiz
309 posts since Oct 2007
Reputation Points: 161
Solved Threads: 43
 

Yes, that can be done in a loop. If you are on MS-Windows just call win32 api function CopyFile().

Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

The only addition to Ancient Dragon's C stream library method: add setvbuf call after fopen:

const size_t BSZ = 1024*32 // or more
...
FILE* fp = fopen("..\\TextFile1.txt", "r");
if (fp) 
{
    setvbuf(fp,0,_IOFBF,BSZ); // No need to free buffers explicitly
    ...

Default stream buffer size is too small for huge files. You will get much more faster file reading. As usually, in VC++ C streams and data conversions are faster than C++ ones.
It's possible to accelerate C++ streams with a proper streambuf declarations but it's the other story and VC++ slow getline absorbs the effect...

ArkM
Postaholic
2,001 posts since Jul 2008
Reputation Points: 1,234
Solved Threads: 348
 

The only addition to Ancient Dragon's C stream library method: add setvbuf call after fopen:

const size_t BSZ = 1024*32 // or more
...
FILE* fp = fopen("..\\TextFile1.txt", "r");
if (fp) 
{
    setvbuf(fp,0,_IOFBF,BSZ); // No need to free buffers explicitly
    ...

Default stream buffer size is too small for huge files. You will get much more faster file reading. As usually, in VC++ C streams and data conversions are faster than C++ ones. It's possible to accelerate C++ streams with a proper streambuf declarations but it's the other story and VC++ slow getline absorbs the effect...

Hmmm, i don't think calling setvbuf has any effect on reading a file. If u loop fgets calls the file is simply read line by line and the read info is placed direclty in the buffer u provide as parameter ( i think).
As far as I know, setvbuf sets the output buffer when writing a file, and as far as I know all file output operations are by default blockbuffered with the buffer set with the optimum size...
U're speed problem is that u read the file line by line. 4 optimum speed eff u should fread chunks of 512 or 1024 bytes ( the optimum size would be u're hdd cluster or sector size, or whatever ) and do the info processing in memory

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

ok, I was actually curious to see the effect of setvbuf over fgets

#include <stdio.h>
#include <iostream>
#include <assert.h>
using namespace std;

int main(int argc, char* argv[])
{
	const char* fname = "d:\\test162MB.txt";
	FILE *fp = fopen( fname, "rb" );

	int x = setvbuf(fp, (char *)NULL, _IOFBF, 512);

	assert( x == 0 &&fp != NULL );

	char mysmallbuf[20];
	while ( fgets( mysmallbuf, 20, fp ) )
	{
	}

/*
	char mybigbuff[1024];
	while ( fread( mybigbuff, 1024, 1, fp ) )
	{
	}
*/
	return 0;
}


running the following code with fgets the 162 mb file is read in 8 secs
running with fread it is read in 2 secs. So my conclussion is that no intermediate 512 bytes buffer is for reading large chunks of the file ( I thought that calling the first fgets would read 512 bytes and store them in a buffer, and the next xxx fgets would get from the intermediate buffer, not directly from the file, thus having the same speed as the fread version, but it seems not, setvbuf just has no effect over fgets, fgets is by default linebufferd )

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

The fread sounds interesting. I am used to VC++ so some calls here are new to me. First I will show exactly how the lines in the file look like:

Monday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Tuesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Wednesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7

Some questions I wonder:
In the fread(), I understand how fp is pointed to the file that will be red.
mybigbuff, I am not really sure what it stands for but it should be a buffer where data is stored I think ?
The next 1024 should be how many bytes that will be red each time ?
I have put the number of 8 next because I read 8 commadelimited values but I dont know what this number stands for ?
I tried to put the number 1 as in the example in the previous post but the program had an errormessage that said: "Expression: nptr != NULL"
Also I dont know what "rb" stands for in: fopen(fname, "rb"); "r" stands for reading I know.
The second argument should be the mode.


However if I use the code below and read this huge file(130 Mb), the messageBox will show after less than 0.5 sec wich is very fast.
I try to use ofstream OutPut to write some values to a file, but nothing is written.
I wonder if I do this correctly. I find this really interesting as I will read thousands of thousands of these files.

ofstream OutPut;
OutPut.open("C:\\out.txt");

double n1, n2, n3, n4, n5, n6, n7;
	
const char* fname = "C:\\test130mb.txt";
FILE* fp = fopen(fname, "rb");


    char mybigbuff[1024];
    while( fread(mybigbuff, 1024, 8, fp) )
    {
        char* p = strtok(text,",");
        p = strtok(NULL, ",");
        n1 = atol(p);
        p = strtok(NULL, ",");
        n2 = atol(p);
        p = strtok(NULL, ",");
        n3 = atol(p);
        p = strtok(NULL, ",");
        n4 = atol(p);
        p = strtok(NULL, ",");
        n5 = atol(p);
        p = strtok(NULL, ",");
        n6 = atol(p);
        p = strtok(NULL, ",");
        n7 = atol(p);

OutPut << text << " " << n1 << "\n";  //This does not give any OutPut
    }

  fclose(fp);
  MessageBox::Show("File has Reached End");
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

if u want to understand fread and setvbuff the best option is to check out their man pages. U can do so typing man fread in google or typing man fread on a linux terminal and having the C standard library documentation installed. Anyway, make it short, u can forget setvbuff as it won't help u much, as for fread:

size_t
fread(void *restrict ptr, size_t size, size_t nitems,
FILE *restrict stream);

ptr is the buffer to that to store the read information, size is the number of octets to read, and nitems is the number of size chunks to read => u actually read size * nitems bytes. What u did above is wrong: u actually read 8 * 1024 bytes in a shot. U should just stick to fread ( buffer, 1024, 1, fp ).

Ok, now considering u're file is ascii encoded, it means that each charater is one byte wide. U read exactly 1024 bytes, so that means after one fread u get A LOT of lines in your buffer. The downside of this approach is that it's up to u to correctly interpret the content of that buffer by looping succesive strtok calls to get data in your desired format. Another downside is that the file position indicator after one read can point anywhere, most likely somewhere inside a line, not it's end, so if u interpret the buffer content line by line, and the last line is not "complete", u have to complete it with the content from the beginning of the next buffer u read from file.

good luck

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

> running the following code with fgets the 162 mb file is read in 8 secs
> running with fread it is read in 2 secs.
Careful, make sure the second one isn't benefitting from the first one causing information to be cached somewhere.

The problem with using fread() for line oriented data is that you have a hell of a repair job with the last line.
Eg, suppose the two consecutive fread() calls look like this red/green
Monday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Tuesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Wednesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7


Nor (as your example so aptly shows the lack of) will fread() automatically append a \0 to the buffer to stop things like strrok from wandering off into the weeds.

Salem
Posting Sage
Team Colleague
11,531 posts since Dec 2005
Reputation Points: 5,862
Solved Threads: 953
 

yeap, salem is right, but if u need speed....
another way would be to use memory mapped files

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

>> that means after one fread u get A LOT of lines in your buffer

Now I understand what 1 stands for and 8 is ofcourse wrong there. Also how it proccesses 1024 bytes each time and that it often will read into the middle of the lines.
If I leave this for a moment because now you got me interested in what "Memory mapped Files" is and how that work.
Could this be as fast as the fread() method ?

yeap, salem is right, but if u need speed.... another way would be to use memory mapped files
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

>> that means after one fread u get A LOT of lines in your buffer

Now I understand what 1 stands for and 8 is ofcourse wrong there. Also how it proccesses 1024 bytes each time and that it often will read into the middle of the lines. If I leave this for a moment because now you got me interested in what "Memory mapped Files" is and how that work. Could this be as fast as the fread() method ?

haven't tried it yet :P The idea is that u map a file in memory and no longer have tu use file I/O operations on it. It is suppose to be very fast for small files due to the fact u don't have to do system calls to read from the file, but for large files it can be slower than the fread version because of numerous page faults that can occur.

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

Not being able to test things at the moment, I would have thought something like this may have been useful.

FILE* fp = fopen(fname, "r");
char buff[BUFSIZ];  // for fgets
int x = setvbuf(fp, (char *)NULL, _IOFBF, BUFSIZ*10);  // underlying reader reads large amounts
while ( fgets( buff, sizeof buff, fp ) != NULL ) {
  // do something with a line
}


On a byte for byte basis, fgets() is more expensive than fread() because it is parsing the stream looking for newlines. But if you decide to fread(), then you need to replicate some of that behaviour yourself (for the split last line I mentioned previously).

Salem
Posting Sage
Team Colleague
11,531 posts since Dec 2005
Reputation Points: 5,862
Solved Threads: 953
 

>> But if you decide to fread(), then you need to replicate some of that behaviour yourself (for the split last line I mentioned previously).

This method(fgets) is a good one compared to ifstream. It is 10 times faster acutally, it reads a file in 8 seconds instead of 80 seconds and here it searching for NewLines, '\n'
As I am not sure, I have tested the fread() out and it will read the file in less than 1 seconds for this same file.
What I wonder is if the "repairjob" that has to be made for each shot of 1024 bytes, will this make the process much slower than this, less than 1 second.
Perheps this is difficult to say though. ? ( I am not sure how to fix this reparing yet, to find '\n' )

Not being able to test things at the moment, I would have thought something like this may have been useful.

FILE* fp = fopen(fname, "r");
char buff[BUFSIZ];  // for fgets
int x = setvbuf(fp, (char *)NULL, _IOFBF, BUFSIZ*10);  // underlying reader reads large amounts
while ( fgets( buff, sizeof buff, fp ) != NULL ) {
  // do something with a line
}

On a byte for byte basis, fgets() is more expensive than fread() because it is parsing the stream looking for newlines. But if you decide to fread(), then you need to replicate some of that behaviour yourself (for the split last line I mentioned previously).

Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).

If it takes you 8 hours to do that, that's 28800 seconds.
At a saving of say 6 seconds a run, that's 4800 runs before you break even.

Salem
Posting Sage
Team Colleague
11,531 posts since Dec 2005
Reputation Points: 5,862
Solved Threads: 953
 

Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).

If it takes you 8 hours to do that, that's 28800 seconds. At a saving of say 6 seconds a run, that's 4800 runs before you break even.


ohhh , come on, it can't take u EIGHT hours... :)

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

Including maintenance effort for the next how many years (say 20), that's exactly what I'm saying (and that's probably very light).

The author leaves, some undiscovered bug lies hidden for a couple of further years and then some poor maintenance guy is left with "WTF!!!???" and the hours tick by....

Sure, you could probably lash something together in about 10 minutes, but that's nowhere near the total cost.

Every time someone reads the code, rather than skipping over some intuitive and familiar fgets() call, they spend a few seconds or minutes wondering whether the tricky fread() and buffer manipulation is doing what it's supposed to (more time on the clock).

Salem
Posting Sage
Team Colleague
11,531 posts since Dec 2005
Reputation Points: 5,862
Solved Threads: 953
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You