So here's what I'm trying to do. Basically, I have a huge text file of strings which are delimited by a delimiter. I would like to read each string separately based on the delimiter. Note that each chunk can span several lines, so I don't want to read it line by line. See this example:

THIS IS ONE STRING~DELIM~THIS IS ANOTHER~DELIM~THIS IS ANOTHER STRING THAT SPANS TWO LINES~DELIM~THIS IS A THIRD ONE THAT SPANS SEVERAL LINESAND THEREFORE MAKES IT HARDER TO SEPARATE FROM OTHER STRINGS ETC.~DELIM

Given that the file is 5GB, I can't read the entire contents of the file into memory. Is there a way do C (or C++) file handling with fseek or equivalent that will put out each chunk? I would like to do this as efficiently as possible, so I don't want to read each line at a time and concatenate the strings if I don't see the delimiter, etc. Is there an easy way?

(edit - unfortunately, this text editor won't properly show the fact that some of the text is on multiple lines, etc - hopefully the point was conveyed based on the wording)

Recommended Answers

All 9 Replies

I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.

I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.

A valid question. The reason is this is data that USED to be in the database and was 'archived' off.

Is one-character-at-a-time possible? What is the intended output?

Is one-character-at-a-time possible? What is the intended output?

Dave,

In the end that's what I pretty much ended up doing. Just thought there might be a faster/more efficient approach.

use fopen and simple code a while loop not to end until the end of the file next get a pointer to the bigging of the file and read the first contents into the file now write that to a text file after that free the pointer and tell the program to skip one delimiter and the second time to skip two............

Another option is to read a good sized chunk of the file (5K, 1M, 10M) and operate on the data while it's in memory. When you have 10% of the chunk left unprocessed, move it to the beginning of the buffer and read the next chunk.

Reading a buffer at a time is much faster than a character at a time.

You can read the file after you open it using this method

char stringg[1000] // as large as you need
j=0;
while (c=getc(input_file) !=eof)
{
stringg[j++]=c;
if (c==(the decimal value of your delimiter) break;
}
stringg[j]=0; //terminate the string

your file pointer is advanced to the beginning of the next string
repeat the same procedure for the next string

you can put this in a subroutine or move the string contents somewhere else and reuse the same string again.

If the file is pretty much static (doesn't change very often) then you can create an index file that contains just the offsets into the master data file of the beginning of each record. Then when you want to read the 50th string just read the offset in the 50th record of the index file then seek to that position in the master data file. Each record in the index file is a 64-bit integer, so it is eash to seek to the desired record in the index file when you already know the record number.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.