954,132 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Searching for text in huge (> 5GB) text files.

So here's what I'm trying to do. Basically, I have a huge text file of strings which are delimited by a delimiter. I would like to read each string separately based on the delimiter. Note that each chunk can span several lines, so I don't want to read it line by line. See this example:

THIS IS ONE STRING~DELIM~THIS IS ANOTHER~DELIM~THIS IS ANOTHER STRING THAT SPANS TWO LINES~DELIM~THIS IS A THIRD ONE THAT SPANS SEVERAL LINESAND THEREFORE MAKES IT HARDER TO SEPARATE FROM OTHER STRINGS ETC.~DELIM


Given that the file is 5GB, I can't read the entire contents of the file into memory. Is there a way do C (or C++) file handling with fseek or equivalent that will put out each chunk? I would like to do this as efficiently as possible, so I don't want to read each line at a time and concatenate the strings if I don't see the delimiter, etc. Is there an easy way?

(edit - unfortunately, this text editor won't properly show the fact that some of the text is on multiple lines, etc - hopefully the point was conveyed based on the wording)

winbatch
Posting Pro in Training
466 posts since Feb 2005
Reputation Points: 68
Solved Threads: 18
 

I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.

cscgal
The Queen of DaniWeb
Administrator
19,421 posts since Feb 2002
Reputation Points: 1,474
Solved Threads: 229
 
I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.



A valid question. The reason is this is data that USED to be in the database and was 'archived' off.

winbatch
Posting Pro in Training
466 posts since Feb 2005
Reputation Points: 68
Solved Threads: 18
 

Is one-character-at-a-time possible? What is the intended output?

Dave Sinkula
long time no c
Team Colleague
5,058 posts since Apr 2004
Reputation Points: 2,780
Solved Threads: 314
 
Is one-character-at-a-time possible? What is the intended output?



Dave,

In the end that's what I pretty much ended up doing. Just thought there might be a faster/more efficient approach.

winbatch
Posting Pro in Training
466 posts since Feb 2005
Reputation Points: 68
Solved Threads: 18
 

use fopen and simple code a while loop not to end until the end of the file next get a pointer to the bigging of the file and read the first contents into the file now write that to a text file after that free the pointer and tell the program to skip one delimiter and the second time to skip two............

nytrokiss
Light Poster
47 posts since Jun 2006
Reputation Points: 10
Solved Threads: 5
 

Another option is to read a good sized chunk of the file (5K, 1M, 10M) and operate on the data while it's in memory. When you have 10% of the chunk left unprocessed, move it to the beginning of the buffer and read the next chunk.

Reading a buffer at a time is much faster than a character at a time.

WaltP
Posting Sage w/ dash of thyme
Moderator
10,492 posts since May 2006
Reputation Points: 3,348
Solved Threads: 943
 

http://www.cplusplus.com/ref/iostream/istream/getline.html
getline allows you to specify a delimiter of your choice, if it's a single char.

Salem
Posting Sage
Team Colleague
11,531 posts since Dec 2005
Reputation Points: 5,862
Solved Threads: 953
 

You can read the file after you open it using this method

char stringg[1000] // as large as you need
j=0;
while (c=getc(input_file) !=eof)
{
stringg[j++]=c;
if (c==(the decimal value of your delimiter) break;
}
stringg[j]=0; //terminate the string


your file pointer is advanced to the beginning of the next string
repeat the same procedure for the next string

you can put this in a subroutine or move the string contents somewhere else and reuse the same string again.

huffstat
Newbie Poster
19 posts since May 2006
Reputation Points: 10
Solved Threads: 0
 

If the file is pretty much static (doesn't change very often) then you can create an index file that contains just the offsets into the master data file of the beginning of each record. Then when you want to read the 50th string just read the offset in the 50th record of the index file then seek to that position in the master data file. Each record in the index file is a 64-bit integer, so it is eash to seek to the desired record in the index file when you already know the record number.

Ancient Dragon
Retired & Loving It
Team Colleague
30,040 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,341
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You