I am reading Comma delimited Large .txt files(About 50 Mb).
Currently I am using the method below to step through the lines in the file.
I have one other application that Read the exact same .txt file that I do.
That application will reach the end of the textFile in 5 Seconds while my method below will do it in 50 seconds. (I dont know what method that application uses)

So what I wonder is if there is a more effective way to read the txtFile than I do.
I have heard and red around that open the file in binary mode will be more efficient but I dont know the method to do this and what data I will get from ifstream.
The lines in the txt file that I read look like this:

Monday,1,2
Tuesday,2,3
Wednesday,3,4

std::string Text1;
double Number1 = 0;
doulbe Number2 = 0;	
char Comma;

	ifstream LargeFile("C:\\LargeFile.txt");

	while( getline(LargeFile, Text1, ',') )		
	{						
	     LargeFile >> Number1;		        
	     LargeFile >> Comma;                   
	     LargeFile >> Number2;			
	     LargeFile.get();	
	}
	MessageBox::Show("File has Reached End");

Recommended Answers

All 35 Replies

The complete example should look like this instead of the previous post.
I both Read from and Write to a .txt File.

std::string Text1;
double Number1 = 0;
doulbe Number2 = 0;	
char Comma;

ofstream OutPut;
OutPut.open("C:\\OutPut.txt");

	ifstream LargeFile("C:\\LargeFile.txt");

	while( getline(LargeFile, Text1, ',') )		
	{						
	     LargeFile >> Number1;		        
	     LargeFile >> Comma;                   
	     LargeFile >> Number2;			
	     LargeFile.get();	

              OutPut << Text1 << ',' << Number1 << ',' << Number2 << '\n';
	}

	MessageBox::Show("File has Reached End");std::string Text1;

Try C style implementation. I don't know if its faster or slower so you will have to test with your huge file.

int main ()
{
    char text[80];
    int n1, n2;
  FILE* fp = fopen("..\\TextFile1.txt", "r");
  if(fp)
  {
    while( fgets(text, sizeof(text), fp) )
    {
        char* p = strtok(text,",");
        p = strtok(NULL, ",");
        n1 = atol(p);
        p = strtok(NULL, ",");
        n2 = atol(p);
       cout << text << " " << n1 << " " << n2 << "\n";
    }
  }
  fclose(fp);
  return 0;
}

Since you are not doing anything with those integers except output them to another file, there is no reason to convert them from char* to int.

int main ()
{
    char text[80];
    int n1, n2;
  FILE* fp = fopen("..\\TextFile1.txt", "r");
  if(fp)
  {
    while( fgets(text, sizeof(text), fp) )
    {
        if(text[strlen(text)-1] == '\n')
            text[strlen(text)-1] = 0;
        char* p1 = strtok(text,",");
        char* p2 = strtok(NULL, ",");
        char* p3 = strtok(NULL, ",");
       cout << p1 << "," << p2 << "," << p3 << "\n";

    }
  }
  fclose(fp);
  return 0;
}
commented: That looks very efficient =) +4

What about using fread() to read a large buffer and then write it out ?

commented: best suggestion :) +36

Yes, that can be done in a loop. If you are on MS-Windows just call win32 api function CopyFile().

The only addition to Ancient Dragon's C stream library method: add setvbuf call after fopen:

const size_t BSZ = 1024*32 // or more
...
FILE* fp = fopen("..\\TextFile1.txt", "r");
if (fp) 
{
    setvbuf(fp,0,_IOFBF,BSZ); // No need to free buffers explicitly
    ...

Default stream buffer size is too small for huge files. You will get much more faster file reading. As usually, in VC++ C streams and data conversions are faster than C++ ones.
It's possible to accelerate C++ streams with a proper streambuf declarations but it's the other story and VC++ slow getline absorbs the effect...

The only addition to Ancient Dragon's C stream library method: add setvbuf call after fopen:

const size_t BSZ = 1024*32 // or more
...
FILE* fp = fopen("..\\TextFile1.txt", "r");
if (fp) 
{
    setvbuf(fp,0,_IOFBF,BSZ); // No need to free buffers explicitly
    ...

Default stream buffer size is too small for huge files. You will get much more faster file reading. As usually, in VC++ C streams and data conversions are faster than C++ ones.
It's possible to accelerate C++ streams with a proper streambuf declarations but it's the other story and VC++ slow getline absorbs the effect...

Hmmm, i don't think calling setvbuf has any effect on reading a file. If u loop fgets calls the file is simply read line by line and the read info is placed direclty in the buffer u provide as parameter ( i think).
As far as I know, setvbuf sets the output buffer when writing a file, and as far as I know all file output operations are by default blockbuffered with the buffer set with the optimum size...
U're speed problem is that u read the file line by line. 4 optimum speed eff u should fread chunks of 512 or 1024 bytes ( the optimum size would be u're hdd cluster or sector size, or whatever ) and do the info processing in memory

ok, I was actually curious to see the effect of setvbuf over fgets

#include <stdio.h>
#include <iostream>
#include <assert.h>
using namespace std;

int main(int argc, char* argv[])
{
	const char* fname = "d:\\test162MB.txt";
	FILE *fp = fopen( fname, "rb" );

	int x = setvbuf(fp, (char *)NULL, _IOFBF, 512);

	assert( x == 0 &&fp != NULL );

	char mysmallbuf[20];
	while ( fgets( mysmallbuf, 20, fp ) )
	{
	}

/*
	char mybigbuff[1024];
	while ( fread( mybigbuff, 1024, 1, fp ) )
	{
	}
*/
	return 0;
}

running the following code with fgets the 162 mb file is read in 8 secs
running with fread it is read in 2 secs. So my conclussion is that no intermediate 512 bytes buffer is for reading large chunks of the file ( I thought that calling the first fgets would read 512 bytes and store them in a buffer, and the next xxx fgets would get from the intermediate buffer, not directly from the file, thus having the same speed as the fread version, but it seems not, setvbuf just has no effect over fgets, fgets is by default linebufferd )

The fread sounds interesting. I am used to VC++ so some calls here are new to me. First I will show exactly how the lines in the file look like:

Monday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Tuesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Wednesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7

Some questions I wonder:
In the fread(), I understand how fp is pointed to the file that will be red.
mybigbuff, I am not really sure what it stands for but it should be a buffer where data is stored I think ?
The next 1024 should be how many bytes that will be red each time ?
I have put the number of 8 next because I read 8 commadelimited values but I dont know what this number stands for ?
I tried to put the number 1 as in the example in the previous post but the program had an errormessage that said: "Expression: nptr != NULL"
Also I dont know what "rb" stands for in: fopen(fname, "rb"); "r" stands for reading I know.
The second argument should be the mode.


However if I use the code below and read this huge file(130 Mb), the messageBox will show after less than 0.5 sec wich is very fast.
I try to use ofstream OutPut to write some values to a file, but nothing is written.
I wonder if I do this correctly. I find this really interesting as I will read thousands of thousands of these files.

ofstream OutPut;
OutPut.open("C:\\out.txt");

double n1, n2, n3, n4, n5, n6, n7;
	
const char* fname = "C:\\test130mb.txt";
FILE* fp = fopen(fname, "rb");


    char mybigbuff[1024];
    while( fread(mybigbuff, 1024, 8, fp) )
    {
        char* p = strtok(text,",");
        p = strtok(NULL, ",");
        n1 = atol(p);
        p = strtok(NULL, ",");
        n2 = atol(p);
        p = strtok(NULL, ",");
        n3 = atol(p);
        p = strtok(NULL, ",");
        n4 = atol(p);
        p = strtok(NULL, ",");
        n5 = atol(p);
        p = strtok(NULL, ",");
        n6 = atol(p);
        p = strtok(NULL, ",");
        n7 = atol(p);

OutPut << text << " " << n1 << "\n";  //This does not give any OutPut
    }

  fclose(fp);
  MessageBox::Show("File has Reached End");

if u want to understand fread and setvbuff the best option is to check out their man pages. U can do so typing man fread in google or typing man fread on a linux terminal and having the C standard library documentation installed. Anyway, make it short, u can forget setvbuff as it won't help u much, as for fread:

size_t
fread(void *restrict ptr, size_t size, size_t nitems,
FILE *restrict stream);

ptr is the buffer to that to store the read information, size is the number of octets to read, and nitems is the number of size chunks to read => u actually read size * nitems bytes. What u did above is wrong: u actually read 8 * 1024 bytes in a shot. U should just stick to fread ( buffer, 1024, 1, fp ).

Ok, now considering u're file is ascii encoded, it means that each charater is one byte wide. U read exactly 1024 bytes, so that means after one fread u get A LOT of lines in your buffer. The downside of this approach is that it's up to u to correctly interpret the content of that buffer by looping succesive strtok calls to get data in your desired format. Another downside is that the file position indicator after one read can point anywhere, most likely somewhere inside a line, not it's end, so if u interpret the buffer content line by line, and the last line is not "complete", u have to complete it with the content from the beginning of the next buffer u read from file.

good luck

> running the following code with fgets the 162 mb file is read in 8 secs
> running with fread it is read in 2 secs.
Careful, make sure the second one isn't benefitting from the first one causing information to be cached somewhere.

The problem with using fread() for line oriented data is that you have a hell of a repair job with the last line.
Eg, suppose the two consecutive fread() calls look like this red/green
Monday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Tuesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7
Wednesday,1.1,1.2,1.3,1.4,1.5,1.6,1.7


Nor (as your example so aptly shows the lack of) will fread() automatically append a \0 to the buffer to stop things like strrok from wandering off into the weeds.

yeap, salem is right, but if u need speed....
another way would be to use memory mapped files

>> that means after one fread u get A LOT of lines in your buffer

Now I understand what 1 stands for and 8 is ofcourse wrong there. Also how it proccesses 1024 bytes each time and that it often will read into the middle of the lines.
If I leave this for a moment because now you got me interested in what "Memory mapped Files" is and how that work.
Could this be as fast as the fread() method ?

yeap, salem is right, but if u need speed....
another way would be to use memory mapped files

>> that means after one fread u get A LOT of lines in your buffer

Now I understand what 1 stands for and 8 is ofcourse wrong there. Also how it proccesses 1024 bytes each time and that it often will read into the middle of the lines.
If I leave this for a moment because now you got me interested in what "Memory mapped Files" is and how that work.
Could this be as fast as the fread() method ?

haven't tried it yet :P The idea is that u map a file in memory and no longer have tu use file I/O operations on it. It is suppose to be very fast for small files due to the fact u don't have to do system calls to read from the file, but for large files it can be slower than the fread version because of numerous page faults that can occur.

Not being able to test things at the moment, I would have thought something like this may have been useful.

FILE* fp = fopen(fname, "r");
char buff[BUFSIZ];  // for fgets
int x = setvbuf(fp, (char *)NULL, _IOFBF, BUFSIZ*10);  // underlying reader reads large amounts
while ( fgets( buff, sizeof buff, fp ) != NULL ) {
  // do something with a line
}

On a byte for byte basis, fgets() is more expensive than fread() because it is parsing the stream looking for newlines. But if you decide to fread(), then you need to replicate some of that behaviour yourself (for the split last line I mentioned previously).

>> But if you decide to fread(), then you need to replicate some of that behaviour yourself (for the split last line I mentioned previously).

This method(fgets) is a good one compared to ifstream. It is 10 times faster acutally, it reads a file in 8 seconds instead of 80 seconds and here it searching for NewLines, '\n'
As I am not sure, I have tested the fread() out and it will read the file in less than 1 seconds for this same file.
What I wonder is if the "repairjob" that has to be made for each shot of 1024 bytes, will this make the process much slower than this, less than 1 second.
Perheps this is difficult to say though. ? ( I am not sure how to fix this reparing yet, to find '\n' )


Not being able to test things at the moment, I would have thought something like this may have been useful.

FILE* fp = fopen(fname, "r");
char buff[BUFSIZ];  // for fgets
int x = setvbuf(fp, (char *)NULL, _IOFBF, BUFSIZ*10);  // underlying reader reads large amounts
while ( fgets( buff, sizeof buff, fp ) != NULL ) {
  // do something with a line
}

On a byte for byte basis, fgets() is more expensive than fread() because it is parsing the stream looking for newlines. But if you decide to fread(), then you need to replicate some of that behaviour yourself (for the split last line I mentioned previously).

Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).

If it takes you 8 hours to do that, that's 28800 seconds.
At a saving of say 6 seconds a run, that's 4800 runs before you break even.

Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).

If it takes you 8 hours to do that, that's 28800 seconds.
At a saving of say 6 seconds a run, that's 4800 runs before you break even.

ohhh , come on, it can't take u EIGHT hours... :)

Including maintenance effort for the next how many years (say 20), that's exactly what I'm saying (and that's probably very light).

The author leaves, some undiscovered bug lies hidden for a couple of further years and then some poor maintenance guy is left with "WTF!!!???" and the hours tick by....

Sure, you could probably lash something together in about 10 minutes, but that's nowhere near the total cost.

Every time someone reads the code, rather than skipping over some intuitive and familiar fgets() call, they spend a few seconds or minutes wondering whether the tricky fread() and buffer manipulation is doing what it's supposed to (more time on the clock).

>> At a saving of say 6 seconds a run, that's 4800 runs before you break even

But I will read thousands and thousands of these files every day over years of time so every second is a huge win. The code that goes inside that loop is now over 300 pages so some extracoding is okay for me :) But the complexity of the code perheps is more difficult to understand but as a first step perheps I will have this error when running the code:
Expression: nptr != NULL

double n1, n2, n3, n4, n5, n6, n7;
	
const char* fname = "C:\\Test130Mb.txt";
FILE* fp = fopen(fname, "rb");


    char mybigbuff[1024];
    while( fread(mybigbuff, 1024, 1, fp) )
    {
        char* p = strtok(text,",");
        p = strtok(NULL, ",");
        n1 = atol(p);
        p = strtok(NULL, ",");
        n2 = atol(p);
        p = strtok(NULL, ",");
        n3 = atol(p);
        p = strtok(NULL, ",");
        n4 = atol(p);
        p = strtok(NULL, ",");
        n5 = atol(p);
        p = strtok(NULL, ",");
        n6 = atol(p);
        p = strtok(NULL, ",");
        n7 = atol(p);
    }



  fclose(fp);
  MessageBox::Show("File has Reached End");

Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).

If it takes you 8 hours to do that, that's 28800 seconds.
At a saving of say 6 seconds a run, that's 4800 runs before you break even.

> while( fread(mybigbuff, 1024, 1, fp) )
> {
> char* p = strtok(text,",");
You're not tokenising what you read.
You're not using the fread result to work out where the end of the buffer is
You're not appending a \0 to stop strtok() from going into the weeds (see previous posts)

This is a new way for me to read a file so I have to first understand the basic steps.
I will assume a few things here to see If I understand the basics of this reading.

What happens first is that 1024 bytes is red into the buffer like a string. My question here. Is this the string: text ?
If it is, the first step here is to tokenise this large string. I know that each line in the file contain 8 different values since 7 Commas delimits them. Since there is 1024 bytes there must be a lot of lines here, so I don´t understand how this will be tokenised.
It must be a technique I dont know about.
I think I just start out like this.

while( fread(mybigbuff, 1024, 1, fp) )
{
      char* p = strtok(text,",");

> while( fread(mybigbuff, 1024, 1, fp) )
> {
> char* p = strtok(text,",");
You're not tokenising what you read.
You're not using the fread result to work out where the end of the buffer is
You're not appending a \0 to stop strtok() from going into the weeds (see previous posts)

This is a new way for me to read a file so I have to first understand the basic steps.
I will assume a few things here to see If I understand the basics of this reading.

What happens first is that 1024 bytes is red into the buffer like a string. My question here. Is this the string: text ?
If it is, the first step here is to tokenise this large string. I know that each line in the file contain 8 different values since 7 Commas delimits them. Since there is 1024 bytes there must be a lot of lines here, so I don´t understand how this will be tokenised.
It must be a technique I dont know about.
I think I just start out like this.

while( fread(mybigbuff, 1024, 1, fp) )
{
      char* p = strtok(text,",");

well, as salem's post said, u don't tokenize what u read.. the strtok should be
char* p = strtok(mybigbuff,",");
because mybigbuff is where u read from the file

now what I sugest u to do:

first decompose the problem: my ideea would be 4 u to first tokenize the buffer line by line, and then use one line however u want....thus simulating the fgets function

I have a text file tokenizer class which processes ~10 million tokens per second on low-medium CPU (it's the raw speed w/o file loading, ~0.2 sec for 50MB file). It loads the whole file in the core via fread but it's possible to map file or use another method. The file loading time is depended on the disk system configuration (and file fragmentation, of course). For my test file (50MB, 2.5 million tokens) loading time was ~2-3 seconds (~0.2 sec after the file was cashed in memory by OS).

If it seems interesting I can present sources...

#include <stdio.h>
#include <stdlib.h>

#include <iostream>
#include <string>
#include <vector>

using namespace std;


void tokenizestring( const std::string& src, const std::string& delim, vector<string>&tokens )
{
	string::size_type start = 0;
	string::size_type end;
        
        tokens.clear();
	for ( ; ; )
	{

		end = src.find ( delim, start );
		if ( end == string::npos )
			break;
		tokens.push_back ( src.substr ( start, end - start ) );

		start = end + delim.size();
	}

}

int main ( int argc, char* argv[] )
{

	FILE *fp;
	if ( (fp = fopen("c:\\test.txt", "rb")) == NULL )
	{
		printf( "unable to open file " );
		exit( 2 );
	}

	const int READSIZE = 1024;
	const int BUFFSIZE = READSIZE + 1;

	char buffer[BUFFSIZE]; // + 1 to have place for appending
				           //a terminal character

	vector<string> tokenized_byline;

	int howmuchIread = 0;
	while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) )  != 0 )
	{
		buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns
		string sBuffer = buffer;


		//the trick is I call a fgets to get the rest of the line and append
		//it to the rest of sBuffer
		//if the file position indicator is at a newline beginning, no problem,
		//we just read another line
		if ( !feof(fp))
		{
			fgets(buffer, READSIZE, fp);
			sBuffer += buffer ;
		}

		//tokenizestring tokenizes using /r/n  as delimiter, and stores the tokenized
		//strings to the vector tokenized_byline
		tokenizestring( sBuffer, "\r\n", tokenized_byline);
               
                //now u process the strings in tokenized_byline however u want

	}


	fclose( fp );


}

ok, u can get inspired from this

on my athlon 2Ghz it read a 50mb file in 3 secs


the algorithm works like this:

1. u read 1024 octets from file using fread
2. u store the result in a std::string
3. u then do a fgets to read the rest of the line ( most likley u end up with the file pos indicator inside a line, if not, there is no problem, u just read the next line to ). This won't slow u down... compered to the "just fgets" versions u saw before, the number of fgets calls will be at least 50 times smaller, depending on the linesizes
4. concatenete the rest of the line u just read with the rest of the string
5. u split the string by lines getting a std::vector<std::string>
6. process the data in the tokenized_byline ( this is up 2 u: u will probably do another tokenization for each string in the vector using "," as separator to get u're required fields: the day and the other numbers u want , bla bla bla )

>> tokenize the buffer line by line, and then use one line however u want

I must be honest to say that I only know in words how to do this. I have never tokinized something. I have red commadelimited ifstream files before. After this code below, I am completely stuck of how all logic continues.
I do understand this code and that "mybigbuff" during the first loop will hold the first 1024 characters of the file like a string/buffer that now need to be tokinizied (I dont know how to do this in code).
What will happen in the end For these commadelimited values is that each "," delimited value will be push_backed in a vector. As for example: one,two and three, they will be push_backed into the same vector and so on for the other 4 values too through the whole file.
When tokinize the buffer I should look for what comes after each ",".
Then I have to keep count by 5 to keep track for wich value that goes into what vector ?

So I think my first question is how to tokinize a buffer like this and how to keep track of wich value that go into what vector.
(One problem that comes directly with this is the red part where the first buffer just stops inside a value like this where I will have a problem with the last value)

one,1,2,3,4
two,5,6,7,8.4
three,9,10,11,12

std::vector<string> stringvalue;
std::vector<string> value1;
std::vector<string> value2;
std::vector<string> value3;
std::vector<string> value4;

char mybigbuff[1024];
while( fread(mybigbuff, 1024, 1, fp) )
{





}

:) Thank you a lot !, I will check this code out carefully to see what I can do and understand !

int howmuchIread = 0;
	while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) )  != 0 )
	{
		buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns

I hope you realize that the code in the last line above will likely cause buffer overflow. Lets say howmuchIread == READSIZE, which is the same as sizeof(buffer). Then buffer[howmuchIread] will be one byte beyond the end of the buffer.

Tokenize means breaking up your data into tokens based on delimiters.

For instance in your case

one,1,2,3,4
two,5,6,7,8.4
three,9,10,11,12

If your data were to be tokenized based on the "," and "\n" delimiters, you would have the tokens "one" "1" "2" "3" "4" and so forth.

In your case if you use fread to read a block of data, you basically have to process it until you reach the last valid token, and then tag on the rest of the buffer to the next buffer you read in.

For example if your token were a whole line, and your buffer had
one,1,2,3,4
two,5,6

Then you would process one,1,2,3,4 and preprend two,5,6 to the start of the next buffer. You should be careful about buffer overflows.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.