943,544 Members | Top Members by Rank

Ad:
  • C++ Discussion Thread
  • Unsolved
  • Views: 4091
  • C++ RSS
Jun 27th, 2006
1

Searching for text in huge (> 5GB) text files.

Expand Post »
So here's what I'm trying to do. Basically, I have a huge text file of strings which are delimited by a delimiter. I would like to read each string separately based on the delimiter. Note that each chunk can span several lines, so I don't want to read it line by line. See this example:
C++ Syntax (Toggle Plain Text)
  1. THIS IS ONE STRING~DELIM~THIS IS ANOTHER~DELIM~THIS IS ANOTHER STRING THAT SPANS TWO LINES~DELIM~THIS IS A THIRD ONE THAT SPANS SEVERAL LINESAND THEREFORE MAKES IT HARDER TO SEPARATE FROM OTHER STRINGS ETC.~DELIM

Given that the file is 5GB, I can't read the entire contents of the file into memory. Is there a way do C (or C++) file handling with fseek or equivalent that will put out each chunk? I would like to do this as efficiently as possible, so I don't want to read each line at a time and concatenate the strings if I don't see the delimiter, etc. Is there an easy way?

(edit - unfortunately, this text editor won't properly show the fact that some of the text is on multiple lines, etc - hopefully the point was conveyed based on the wording)
Last edited by winbatch; Jun 27th, 2006 at 6:42 pm.
Similar Threads
Reputation Points: 68
Solved Threads: 18
Posting Pro in Training
winbatch is offline Offline
466 posts
since Feb 2005
Jun 27th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.
Administrator
Staff Writer
Reputation Points: 1422
Solved Threads: 162
The Queen of DaniWeb
cscgal is offline Offline
13,645 posts
since Feb 2002
Jun 27th, 2006
1

Re: Searching for text in huge (> 5GB) text files.

Quote originally posted by cscgal ...
I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.
A valid question. The reason is this is data that USED to be in the database and was 'archived' off.
Reputation Points: 68
Solved Threads: 18
Posting Pro in Training
winbatch is offline Offline
466 posts
since Feb 2005
Jun 27th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

Is one-character-at-a-time possible? What is the intended output?
Team Colleague
Reputation Points: 2780
Solved Threads: 312
long time no c
Dave Sinkula is offline Offline
4,790 posts
since Apr 2004
Jun 27th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

Quote originally posted by Dave Sinkula ...
Is one-character-at-a-time possible? What is the intended output?
Dave,

In the end that's what I pretty much ended up doing. Just thought there might be a faster/more efficient approach.
Reputation Points: 68
Solved Threads: 18
Posting Pro in Training
winbatch is offline Offline
466 posts
since Feb 2005
Jun 28th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

use fopen and simple code a while loop not to end until the end of the file next get a pointer to the bigging of the file and read the first contents into the file now write that to a text file after that free the pointer and tell the program to skip one delimiter and the second time to skip two............
Reputation Points: 10
Solved Threads: 4
Light Poster
nytrokiss is offline Offline
47 posts
since Jun 2006
Jun 28th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

Another option is to read a good sized chunk of the file (5K, 1M, 10M) and operate on the data while it's in memory. When you have 10% of the chunk left unprocessed, move it to the beginning of the buffer and read the next chunk.

Reading a buffer at a time is much faster than a character at a time.
Moderator
Reputation Points: 3275
Solved Threads: 890
Posting Sage
WaltP is offline Offline
7,716 posts
since May 2006
Jun 28th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

http://www.cplusplus.com/ref/iostrea...m/getline.html
getline allows you to specify a delimiter of your choice, if it's a single char.
Team Colleague
Reputation Points: 5862
Solved Threads: 950
Posting Sage
Salem is offline Offline
7,164 posts
since Dec 2005
Jun 30th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

You can read the file after you open it using this method

C++ Syntax (Toggle Plain Text)
  1. char stringg[1000] // as large as you need
  2. j=0;
  3. while (c=getc(input_file) !=eof)
  4. {
  5. stringg[j++]=c;
  6. if (c==(the decimal value of your delimiter) break;
  7. }
  8. stringg[j]=0; //terminate the string

your file pointer is advanced to the beginning of the next string
repeat the same procedure for the next string

you can put this in a subroutine or move the string contents somewhere else and reuse the same string again.
Reputation Points: 10
Solved Threads: 0
Newbie Poster
huffstat is offline Offline
19 posts
since May 2006
Jun 30th, 2006
0

Re: Searching for text in huge (> 5GB) text files.

If the file is pretty much static (doesn't change very often) then you can create an index file that contains just the offsets into the master data file of the beginning of each record. Then when you want to read the 50th string just read the offset in the 50th record of the index file then seek to that position in the master data file. Each record in the index file is a 64-bit integer, so it is eash to seek to the desired record in the index file when you already know the record number.
Sponsor
Team Colleague
Featured Poster
Reputation Points: 5608
Solved Threads: 2282
Retired and Enjoying Life
Ancient Dragon is online now Online
21,947 posts
since Aug 2005

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in C++ Forum Timeline: HELP!! don't understand c++ error messages
Next Thread in C++ Forum Timeline: Calling a var with other var value





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC