>> At a saving of say 6 seconds a run, that's 4800 runs before you break even
But I will read thousands and thousands of these files every day over years of time so every second is a huge win. The code that goes inside that loop is now over 300 pages so some extracoding is okay for me :) But the complexity of the code perheps is more difficult to understand but as a first step perheps I will have this error when running the code:
Expression: nptr != NULL
double n1, n2, n3, n4, n5, n6, n7;
const char* fname = "C:\\Test130Mb.txt";
FILE* fp = fopen(fname, "rb");
char mybigbuff[1024];
while( fread(mybigbuff, 1024, 1, fp) )
{
char* p = strtok(text,",");
p = strtok(NULL, ",");
n1 = atol(p);
p = strtok(NULL, ",");
n2 = atol(p);
p = strtok(NULL, ",");
n3 = atol(p);
p = strtok(NULL, ",");
n4 = atol(p);
p = strtok(NULL, ",");
n5 = atol(p);
p = strtok(NULL, ",");
n6 = atol(p);
p = strtok(NULL, ",");
n7 = atol(p);
}
fclose(fp);
MessageBox::Show("File has Reached End"); Whilst 80 to 8 seems like a good win with minimal effort, I'm not sure that 8 seconds down to 1 or 2 is of any further benefit, given the sudden jump in code complexity (time to write, time to debug, time to maintain).
If it takes you 8 hours to do that, that's 28800 seconds. At a saving of say 6 seconds a run, that's 4800 runs before you break even.
> while( fread(mybigbuff, 1024, 1, fp) )
> {
> char* p = strtok(text,",");
You're not tokenising what you read.
You're not using the fread result to work out where the end of the buffer is
You're not appending a \0 to stop strtok() from going into the weeds (see previous posts)
This is a new way for me to read a file so I have to first understand the basic steps.
I will assume a few things here to see If I understand the basics of this reading.
What happens first is that 1024 bytes is red into the buffer like a string. My question here. Is this the string: text ?
If it is, the first step here is to tokenise this large string. I know that each line in the file contain 8 different values since 7 Commas delimits them. Since there is 1024 bytes there must be a lot of lines here, so I don´t understand how this will be tokenised.
It must be a technique I dont know about.
I think I just start out like this.
while( fread(mybigbuff, 1024, 1, fp) )
{
char* p = strtok(text,","); > while( fread(mybigbuff, 1024, 1, fp) ) > { > char* p = strtok(text,","); You're not tokenising what you read. You're not using the fread result to work out where the end of the buffer is You're not appending a \0 to stop strtok() from going into the weeds (see previous posts)
This is a new way for me to read a file so I have to first understand the basic steps. I will assume a few things here to see If I understand the basics of this reading.
What happens first is that 1024 bytes is red into the buffer like a string. My question here. Is this the string: text ? If it is, the first step here is to tokenise this large string. I know that each line in the file contain 8 different values since 7 Commas delimits them. Since there is 1024 bytes there must be a lot of lines here, so I don´t understand how this will be tokenised. It must be a technique I dont know about. I think I just start out like this.
while( fread(mybigbuff, 1024, 1, fp) ) { char* p = strtok(text,",");
well, as salem's post said, u don't tokenize what u read.. the strtok should be
char* p = strtok(mybigbuff,",");
because mybigbuff is where u read from the file
now what I sugest u to do:
first decompose the problem: my ideea would be 4 u to first tokenize the buffer line by line, and then use one line however u want....thus simulating the fgets function
I have a text file tokenizer class which processes ~10 million tokens per second on low-medium CPU (it's the raw speed w/o file loading, ~0.2 sec for 50MB file). It loads the whole file in the core via fread but it's possible to map file or use another method. The file loading time is depended on the disk system configuration (and file fragmentation, of course). For my test file (50MB, 2.5 million tokens) loading time was ~2-3 seconds (~0.2 sec after the file was cashed in memory by OS).
If it seems interesting I can present sources...
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
void tokenizestring( const std::string& src, const std::string& delim, vector<string>&tokens )
{
string::size_type start = 0;
string::size_type end;
tokens.clear();
for ( ; ; )
{
end = src.find ( delim, start );
if ( end == string::npos )
break;
tokens.push_back ( src.substr ( start, end - start ) );
start = end + delim.size();
}
}
int main ( int argc, char* argv[] )
{
FILE *fp;
if ( (fp = fopen("c:\\test.txt", "rb")) == NULL )
{
printf( "unable to open file " );
exit( 2 );
}
const int READSIZE = 1024;
const int BUFFSIZE = READSIZE + 1;
char buffer[BUFFSIZE]; // + 1 to have place for appending
//a terminal character
vector<string> tokenized_byline;
int howmuchIread = 0;
while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) ) != 0 )
{
buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns
string sBuffer = buffer;
//the trick is I call a fgets to get the rest of the line and append
//it to the rest of sBuffer
//if the file position indicator is at a newline beginning, no problem,
//we just read another line
if ( !feof(fp))
{
fgets(buffer, READSIZE, fp);
sBuffer += buffer ;
}
//tokenizestring tokenizes using /r/n as delimiter, and stores the tokenized
//strings to the vector tokenized_byline
tokenizestring( sBuffer, "\r\n", tokenized_byline);
//now u process the strings in tokenized_byline however u want
}
fclose( fp );
} ok, u can get inspired from this
on my athlon 2Ghz it read a 50mb file in 3 secs
the algorithm works like this:
1. u read 1024 octets from file using fread
2. u store the result in a std::string
3. u then do a fgets to read the rest of the line ( most likley u end up with the file pos indicator inside a line, if not, there is no problem, u just read the next line to ). This won't slow u down... compered to the "just fgets" versions u saw before, the number of fgets calls will be at least 50 times smaller, depending on the linesizes
4. concatenete the rest of the line u just read with the rest of the string
5. u split the string by lines getting a std::vector
6. process the data in the tokenized_byline ( this is up 2 u: u will probably do another tokenization for each string in the vector using "," as separator to get u're required fields: the day and the other numbers u want , bla bla bla )
>> tokenize the buffer line by line, and then use one line however u want
I must be honest to say that I only know in words how to do this. I have never tokinized something. I have red commadelimited ifstream files before. After this code below, I am completely stuck of how all logic continues.
I do understand this code and that "mybigbuff" during the first loop will hold the first 1024 characters of the file like a string/buffer that now need to be tokinizied (I dont know how to do this in code).
What will happen in the end For these commadelimited values is that each "," delimited value will be push_backed in a vector. As for example: one,two and three, they will be push_backed into the same vector and so on for the other 4 values too through the whole file.
When tokinize the buffer I should look for what comes after each ",".
Then I have to keep count by 5 to keep track for wich value that goes into what vector ?
So I think my first question is how to tokinize a buffer like this and how to keep track of wich value that go into what vector.
(One problem that comes directly with this is the red part where the first buffer just stops inside a value like this where I will have a problem with the last value)
one,1,2,3,4
two,5,6,7,8.4
three,9,10,11,12
std::vector<string> stringvalue;
std::vector<string> value1;
std::vector<string> value2;
std::vector<string> value3;
std::vector<string> value4;
char mybigbuff[1024];
while( fread(mybigbuff, 1024, 1, fp) )
{
}:) Thank you a lot !, I will check this code out carefully to see what I can do and understand !
int howmuchIread = 0;
while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) ) != 0 )
{
buffer[ howmuchIread ] = 0; //place terminal character to avoid overruns I hope you realize that the code in the last line above will likely cause buffer overflow. Lets say howmuchIread == READSIZE, which is the same as sizeof(buffer). Then buffer[howmuchIread] will be one byte beyond the end of the buffer.
Tokenize means breaking up your data into tokens based on delimiters.
For instance in your case
one,1,2,3,4
two,5,6,7,8.4
three,9,10,11,12
If your data were to be tokenized based on the "," and "\n" delimiters, you would have the tokens "one" "1" "2" "3" "4" and so forth.
In your case if you use fread to read a block of data, you basically have to process it until you reach the last valid token, and then tag on the rest of the buffer to the next buffer you read in.
For example if your token were a whole line, and your buffer had
one,1,2,3,4
two,5,6
Then you would process one,1,2,3,4 and preprend two,5,6 to the start of the next buffer. You should be careful about buffer overflows.
int howmuchIread = 0; while ( ( howmuchIread = fread ( buffer, 1, READSIZE, fp ) ) != 0 ) { buffer[ howmuchIread ] = 0; //place terminal character to avoid overrunsI hope you realize that the code in the last line above will likely cause buffer overflow. Lets say howmuchIread == READSIZE, which is the same as sizeof(buffer). Then buffer[howmuchIread] will be one byte beyond the end of the buffer.
if u look closer at the code, u will see that buffer is buffer[BUFFSIZE], and BUFFSIE = READSIZE + 1, so... no overrun there :)
If you are using MFC you can try something like:
CFile inFile("test.txt", CFile::modeRead);
CArchive archive(&inFile, CArchive::load, inFile.GetLength());
CString line;
while (archive.ReadString(line))
{
// do something with <line> here
}
archive.Close();
inFile.Close();If you are using MFC you can try something like:
CFile inFile("test.txt", CFile::modeRead); CArchive archive(&inFile, CArchive::load, inFile.GetLength()); CString line; while (archive.ReadString(line)) { // do something with <line> here } archive.Close(); inFile.Close();
hmmm,not shure, this kind of looks like you would load the entire file in memory. 4 very large file it would be very memory consuming i think
hmmm,not shure, this kind of looks like you would load the entire file in memory. 4 very large file it would be very memory consuming i think
Yes, but loading the whole file in a single step makes it very fast. I agree, for files of a several 100 MB or more I wouldn't recommend this method, too.
I am reading Comma delimited Large .txt files(About 50 Mb).
And one would surmise that you are not simply copying the input to an output, that you are doing some sort of manipulations. This is probably very key to answering your question in full. Rather than micro-optimizing each particular function call, work on the overall algorithm. Or at least present an overview so that better answers for your overall effort may come as a result.
You are completely right about that. I have a large amout of code that I use after I red the values from the textfile. I mainly use the std:: namespace to substring, convert from text-number-text, like stringstream does, std::string::size_type, string1.length() etc...
I have experiment with this all day and found out that the System:: namespace have a much better performance than std:: namespace regarding the conversions and search in strings that I do. So I will have a great job to exchange all these operations and this will also improve performance and speed. However very nice.
As this example does this loop in 0.9 sec while the stringstream conversion that I use now will do this in 22 seconds.
double Number1;
String^ Num = "4.34";
for(int i = 0; i < 2000000; i++)
{
Number1 = System::Convert::ToDouble(Num);
}
MessageBox::Show("Finish");
As I have discovered this better performance above with System:: I thinking of testing to also read a file with the System:: namespace. However I have never done that before and have a bit of a problem to find an example of how to do that.
I think I should use:System::IO:: StreamReader
if I would read the file that I have now with ifstream and getline as an example, it would look like this. How would this example look like with the System::IO namespace.
Still searching google for this.
std::string Text;
double n1, n2;
char Comma;
ifstream ReadFile("C:\\File.txt");
while( getline(ReadFile, Text, ',') )
{
ReadFile >> n1;
ReadFile >> Comma;
ReadFile >> n2;
ReadFile.get();
} And one would surmise that you are not simply copying the input to an output, that you are doing some sort of manipulations. This is probably very key to answering your question in full. Rather than micro-optimizing each particular function call, work on the overall algorithm. Or at least present an overview so that better answers for your overall effort may come as a result.