954,153 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

>> At a saving of say 6 seconds a run, that's 4800 runs before you break even

But I will read thousands and thousands of these files every day over years of time so every second is a huge win. The code that goes inside that loop is now over 300 pages so some extracoding is okay for me :) But the complexity of the code perheps is more difficult to understand but as a first step perheps I will have this error when running the code:
Expression: nptr != NULL

double n1, n2, n3, n4, n5, n6, n7;
	
const char* fname = "C:\\Test130Mb.txt";
FILE* fp = fopen(fname, "rb");


    char mybigbuff[1024];
    while( fread(mybigbuff, 1024, 1, fp) )
    {
        char* p = strtok(text,",");
        p = strtok(NULL, ",");
        n1 = atol(p);
        p = strtok(NULL, ",");
        n2 = atol(p);
        p = strtok(NULL, ",");
        n3 = atol(p);
        p = strtok(NULL, ",");
        n4 = atol(p);
        p = strtok(NULL, ",");
        n5 = atol(p);
        p = strtok(NULL, ",");
        n6 = atol(p);
        p = strtok(NULL, ",");
        n7 = atol(p);
    }



  fclose(fp);
  MessageBox::Show("File has Reached End");

Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).

If it takes you 8 hours to do that, that's 28800 seconds. At a saving of say 6 seconds a run, that's 4800 runs before you break even.

Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

> while( fread(mybigbuff, 1024, 1, fp) )
> {
> char* p = strtok(text,",");
You're not tokenising what you read.
You're not using the fread result to work out where the end of the buffer is
You're not appending a \0 to stop strtok() from going into the weeds (see previous posts)

Salem
Posting Sage
Team Colleague
11,531 posts since Dec 2005
Reputation Points: 5,862
Solved Threads: 953
 

This is a new way for me to read a file so I have to first understand the basic steps.
I will assume a few things here to see If I understand the basics of this reading.

What happens first is that 1024 bytes is red into the buffer like a string. My question here. Is this the string: text ?
If it is, the first step here is to tokenise this large string. I know that each line in the file contain 8 different values since 7 Commas delimits them. Since there is 1024 bytes there must be a lot of lines here, so I don´t understand how this will be tokenised.
It must be a technique I dont know about.
I think I just start out like this.

while( fread(mybigbuff, 1024, 1, fp) )
{
      char* p = strtok(text,",");
> while( fread(mybigbuff, 1024, 1, fp) ) > { > char* p = strtok(text,","); You're not tokenising what you read. You're not using the fread result to work out where the end of the buffer is You're not appending a \0 to stop strtok() from going into the weeds (see previous posts)
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

This is a new way for me to read a file so I have to first understand the basic steps. I will assume a few things here to see If I understand the basics of this reading.

What happens first is that 1024 bytes is red into the buffer like a string. My question here. Is this the string: text ? If it is, the first step here is to tokenise this large string. I know that each line in the file contain 8 different values since 7 Commas delimits them. Since there is 1024 bytes there must be a lot of lines here, so I don´t understand how this will be tokenised. It must be a technique I dont know about. I think I just start out like this.

while( fread(mybigbuff, 1024, 1, fp) )
{
      char* p = strtok(text,",");

well, as salem's post said, u don't tokenize what u read.. the strtok should be
char* p = strtok(mybigbuff,",");
because mybigbuff is where u read from the file

now what I sugest u to do:

first decompose the problem: my ideea would be 4 u to first tokenize the buffer line by line, and then use one line however u want....thus simulating the fgets function

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

I have a text file tokenizer class which processes ~10 million tokens per second on low-medium CPU (it's the raw speed w/o file loading, ~0.2 sec for 50MB file). It loads the whole file in the core via fread but it's possible to map file or use another method. The file loading time is depended on the disk system configuration (and file fragmentation, of course). For my test file (50MB, 2.5 million tokens) loading time was ~2-3 seconds (~0.2 sec after the file was cashed in memory by OS).

If it seems interesting I can present sources...

ArkM
Postaholic
2,001 posts since Jul 2008
Reputation Points: 1,234
Solved Threads: 348
 
#include <stdio.h>
#include <stdlib.h>

#include <iostream>
#include <string>
#include <vector>

using namespace std;


void tokenizestring( const std::string& src, const std::string& delim, vector<string>&tokens )
{
	string::size_type start = 0;
	string::size_type end;
        
        tokens.clear();
	for ( ; ; )
	{

		end = src.find ( delim, start );
		if ( end == string::npos )
			break;
		tokens.push_back ( src.substr ( start, end - start ) );

		start = end + delim.size();
	}

}

int main ( int argc, char* argv[] )
{

	FILE *fp;
	if ( (fp = fopen("c:\\test.txt", "rb")) == NULL )
	{
		printf( "unable to open file " );
		exit( 2 );
	}

	const int READSIZE = 1024;
	const int BUFFSIZE = READSIZE + 1;

	char buffer[BUFFSIZE]; // + 1 to have place for appending
				           //a terminal character

	vector<string> tokenized_byline;

	int howmuchIread = 0;
	while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) )  != 0 )
	{
		buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns
		string sBuffer = buffer;


		//the trick is I call a fgets to get the rest of the line and append
		//it to the rest of sBuffer
		//if the file position indicator is at a newline beginning, no problem,
		//we just read another line
		if ( !feof(fp))
		{
			fgets(buffer, READSIZE, fp);
			sBuffer += buffer ;
		}

		//tokenizestring tokenizes using /r/n  as delimiter, and stores the tokenized
		//strings to the vector tokenized_byline
		tokenizestring( sBuffer, "\r\n", tokenized_byline);
               
                //now u process the strings in tokenized_byline however u want

	}


	fclose( fp );


}

ok, u can get inspired from this

on my athlon 2Ghz it read a 50mb file in 3 secs


the algorithm works like this:

1. u read 1024 octets from file using fread
2. u store the result in a std::string
3. u then do a fgets to read the rest of the line ( most likley u end up with the file pos indicator inside a line, if not, there is no problem, u just read the next line to ). This won't slow u down... compered to the "just fgets" versions u saw before, the number of fgets calls will be at least 50 times smaller, depending on the linesizes
4. concatenete the rest of the line u just read with the rest of the string
5. u split the string by lines getting a std::vector
6. process the data in the tokenized_byline ( this is up 2 u: u will probably do another tokenization for each string in the vector using "," as separator to get u're required fields: the day and the other numbers u want , bla bla bla )

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

>> tokenize the buffer line by line, and then use one line however u want

I must be honest to say that I only know in words how to do this. I have never tokinized something. I have red commadelimited ifstream files before. After this code below, I am completely stuck of how all logic continues.
I do understand this code and that "mybigbuff" during the first loop will hold the first 1024 characters of the file like a string/buffer that now need to be tokinizied (I dont know how to do this in code).
What will happen in the end For these commadelimited values is that each "," delimited value will be push_backed in a vector. As for example: one,two and three, they will be push_backed into the same vector and so on for the other 4 values too through the whole file.
When tokinize the buffer I should look for what comes after each ",".
Then I have to keep count by 5 to keep track for wich value that goes into what vector ?

So I think my first question is how to tokinize a buffer like this and how to keep track of wich value that go into what vector.
(One problem that comes directly with this is the red part where the first buffer just stops inside a value like this where I will have a problem with the last value)

one,1,2,3,4
two,5,6,7,8.4
three,9,10,11,12

std::vector<string> stringvalue;
std::vector<string> value1;
std::vector<string> value2;
std::vector<string> value3;
std::vector<string> value4;

char mybigbuff[1024];
while( fread(mybigbuff, 1024, 1, fp) )
{





}
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

:) Thank you a lot !, I will check this code out carefully to see what I can do and understand !

Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 
int howmuchIread = 0;
	while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) )  != 0 )
	{
		buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns

I hope you realize that the code in the last line above will likely cause buffer overflow. Lets say howmuchIread == READSIZE, which is the same as sizeof(buffer). Then buffer[howmuchIread] will be one byte beyond the end of the buffer.

Ancient Dragon
Retired & Loving It
Team Colleague
30,042 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,341
 

Tokenize means breaking up your data into tokens based on delimiters.

For instance in your case

one,1,2,3,4
two,5,6,7,8.4
three,9,10,11,12

If your data were to be tokenized based on the "," and "\n" delimiters, you would have the tokens "one" "1" "2" "3" "4" and so forth.

In your case if you use fread to read a block of data, you basically have to process it until you reach the last valid token, and then tag on the rest of the buffer to the next buffer you read in.

For example if your token were a whole line, and your buffer had
one,1,2,3,4
two,5,6

Then you would process one,1,2,3,4 and preprend two,5,6 to the start of the next buffer. You should be careful about buffer overflows.

stilllearning
Posting Whiz
309 posts since Oct 2007
Reputation Points: 161
Solved Threads: 43
 
int howmuchIread = 0;
	while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) )  != 0 )
	{
		buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns

I hope you realize that the code in the last line above will likely cause buffer overflow. Lets say howmuchIread == READSIZE, which is the same as sizeof(buffer). Then buffer[howmuchIread] will be one byte beyond the end of the buffer.

if u look closer at the code, u will see that buffer is buffer[BUFFSIZE], and BUFFSIE = READSIZE + 1, so... no overrun there :)

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 

If you are using MFC you can try something like:

CFile inFile("test.txt", CFile::modeRead);
CArchive archive(&inFile, CArchive::load, inFile.GetLength());
CString line;
while (archive.ReadString(line))
{
        // do something with <line> here
}
archive.Close();
inFile.Close();
jencas
Posting Whiz
366 posts since Dec 2007
Reputation Points: 395
Solved Threads: 71
 

If you are using MFC you can try something like:

CFile inFile("test.txt", CFile::modeRead);
CArchive archive(&inFile, CArchive::load, inFile.GetLength());
CString line;
while (archive.ReadString(line))
{
        // do something with <line> here
}
archive.Close();
inFile.Close();

hmmm,not shure, this kind of looks like you would load the entire file in memory. 4 very large file it would be very memory consuming i think

kux
Junior Poster
119 posts since Jan 2008
Reputation Points: 66
Solved Threads: 11
 
hmmm,not shure, this kind of looks like you would load the entire file in memory. 4 very large file it would be very memory consuming i think

Yes, but loading the whole file in a single step makes it very fast. I agree, for files of a several 100 MB or more I wouldn't recommend this method, too.

jencas
Posting Whiz
366 posts since Dec 2007
Reputation Points: 395
Solved Threads: 71
 
I am reading Comma delimited Large .txt files(About 50 Mb).

And one would surmise that you are not simply copying the input to an output, that you are doing some sort of manipulations. This is probably very key to answering your question in full. Rather than micro-optimizing each particular function call, work on the overall algorithm. Or at least present an overview so that better answers for your overall effort may come as a result.

Dave Sinkula
long time no c
Team Colleague
5,058 posts since Apr 2004
Reputation Points: 2,780
Solved Threads: 314
 

You are completely right about that. I have a large amout of code that I use after I red the values from the textfile. I mainly use the std:: namespace to substring, convert from text-number-text, like stringstream does, std::string::size_type, string1.length() etc...

I have experiment with this all day and found out that the System:: namespace have a much better performance than std:: namespace regarding the conversions and search in strings that I do. So I will have a great job to exchange all these operations and this will also improve performance and speed. However very nice.

As this example does this loop in 0.9 sec while the stringstream conversion that I use now will do this in 22 seconds.

double Number1;
	String^ Num = "4.34";

	for(int i = 0; i < 2000000; i++)
	{			
		Number1 = System::Convert::ToDouble(Num);
	}
	MessageBox::Show("Finish");


As I have discovered this better performance above with System:: I thinking of testing to also read a file with the System:: namespace. However I have never done that before and have a bit of a problem to find an example of how to do that.
I think I should use:System::IO:: StreamReader

if I would read the file that I have now with ifstream and getline as an example, it would look like this. How would this example look like with the System::IO namespace.
Still searching google for this.

std::string Text;
double n1, n2;
char Comma;

ifstream ReadFile("C:\\File.txt");

while( getline(ReadFile, Text, ',') )
{
          ReadFile >> n1;
          ReadFile >> Comma;
          ReadFile >> n2;
          ReadFile.get();
}
And one would surmise that you are not simply copying the input to an output, that you are doing some sort of manipulations. This is probably very key to answering your question in full. Rather than micro-optimizing each particular function call, work on the overall algorithm. Or at least present an overview so that better answers for your overall effort may come as a result.
Jennifer84
Posting Pro
564 posts since Feb 2008
Reputation Points: 10
Solved Threads: 1
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You