Searching for text in huge (> 5GB) text files.

Please support our C++ advertiser: Intel Parallel Studio Home
Reply

Join Date: Feb 2005
Posts: 466
Reputation: winbatch is on a distinguished road 
Solved Threads: 18
winbatch's Avatar
winbatch winbatch is offline Offline
Posting Pro in Training

Searching for text in huge (> 5GB) text files.

 
1
  #1
Jun 27th, 2006
So here's what I'm trying to do. Basically, I have a huge text file of strings which are delimited by a delimiter. I would like to read each string separately based on the delimiter. Note that each chunk can span several lines, so I don't want to read it line by line. See this example:
  1. THIS IS ONE STRING~DELIM~THIS IS ANOTHER~DELIM~THIS IS ANOTHER STRING THAT SPANS TWO LINES~DELIM~THIS IS A THIRD ONE THAT SPANS SEVERAL LINESAND THEREFORE MAKES IT HARDER TO SEPARATE FROM OTHER STRINGS ETC.~DELIM

Given that the file is 5GB, I can't read the entire contents of the file into memory. Is there a way do C (or C++) file handling with fseek or equivalent that will put out each chunk? I would like to do this as efficiently as possible, so I don't want to read each line at a time and concatenate the strings if I don't see the delimiter, etc. Is there an easy way?

(edit - unfortunately, this text editor won't properly show the fact that some of the text is on multiple lines, etc - hopefully the point was conveyed based on the wording)
Last edited by winbatch; Jun 27th, 2006 at 6:42 pm.
Reply With Quote Quick reply to this message  
Join Date: Feb 2002
Posts: 12,040
Reputation: cscgal is a glorious beacon of light cscgal is a glorious beacon of light cscgal is a glorious beacon of light cscgal is a glorious beacon of light cscgal is a glorious beacon of light cscgal is a glorious beacon of light 
Solved Threads: 128
Administrator
Staff Writer
cscgal's Avatar
cscgal cscgal is offline Offline
The Queen of DaniWeb

Re: Searching for text in huge (> 5GB) text files.

 
0
  #2
Jun 27th, 2006
I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.
Dani the Computer Science Gal
Follow my Twitter feed! twitter.com/daniweb
Reply With Quote Quick reply to this message  
Join Date: Feb 2005
Posts: 466
Reputation: winbatch is on a distinguished road 
Solved Threads: 18
winbatch's Avatar
winbatch winbatch is offline Offline
Posting Pro in Training

Re: Searching for text in huge (> 5GB) text files.

 
1
  #3
Jun 27th, 2006
Originally Posted by cscgal
I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.
A valid question. The reason is this is data that USED to be in the database and was 'archived' off.
Reply With Quote Quick reply to this message  
Join Date: Apr 2004
Posts: 4,361
Reputation: Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future Dave Sinkula has a brilliant future 
Solved Threads: 241
Team Colleague
Dave Sinkula's Avatar
Dave Sinkula Dave Sinkula is offline Offline
long time no c

Re: Searching for text in huge (> 5GB) text files.

 
0
  #4
Jun 27th, 2006
Is one-character-at-a-time possible? What is the intended output?
"One of the methods used by statists to destroy capitalism consists in establishing controls that tie a given industry hand and foot, making it unable to solve its problems, then declaring that freedom has failed and stronger controls are necessary." --Ayn Rand
Reply With Quote Quick reply to this message  
Join Date: Feb 2005
Posts: 466
Reputation: winbatch is on a distinguished road 
Solved Threads: 18
winbatch's Avatar
winbatch winbatch is offline Offline
Posting Pro in Training

Re: Searching for text in huge (> 5GB) text files.

 
0
  #5
Jun 27th, 2006
Originally Posted by Dave Sinkula
Is one-character-at-a-time possible? What is the intended output?
Dave,

In the end that's what I pretty much ended up doing. Just thought there might be a faster/more efficient approach.
Reply With Quote Quick reply to this message  
Join Date: Jun 2006
Posts: 47
Reputation: nytrokiss is an unknown quantity at this point 
Solved Threads: 4
nytrokiss nytrokiss is offline Offline
Light Poster

Re: Searching for text in huge (> 5GB) text files.

 
0
  #6
Jun 28th, 2006
use fopen and simple code a while loop not to end until the end of the file next get a pointer to the bigging of the file and read the first contents into the file now write that to a text file after that free the pointer and tell the program to skip one delimiter and the second time to skip two............
Reply With Quote Quick reply to this message  
Join Date: May 2006
Posts: 3,114
Reputation: WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of 
Solved Threads: 281
Moderator
WaltP's Avatar
WaltP WaltP is offline Offline
Posting Sensei

Re: Searching for text in huge (> 5GB) text files.

 
0
  #7
Jun 28th, 2006
Another option is to read a good sized chunk of the file (5K, 1M, 10M) and operate on the data while it's in memory. When you have 10% of the chunk left unprocessed, move it to the beginning of the buffer and read the next chunk.

Reading a buffer at a time is much faster than a character at a time.
Reply With Quote Quick reply to this message  
Join Date: Dec 2005
Posts: 5,850
Reputation: Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute Salem has a reputation beyond repute 
Solved Threads: 749
Team Colleague
Salem's Avatar
Salem Salem is offline Offline
Void main'ers are DOOMed

Re: Searching for text in huge (> 5GB) text files.

 
0
  #8
Jun 28th, 2006
http://www.cplusplus.com/ref/iostrea...m/getline.html
getline allows you to specify a delimiter of your choice, if it's a single char.
Reply With Quote Quick reply to this message  
Join Date: May 2006
Posts: 19
Reputation: huffstat is an unknown quantity at this point 
Solved Threads: 0
huffstat huffstat is offline Offline
Newbie Poster

Re: Searching for text in huge (> 5GB) text files.

 
0
  #9
Jun 30th, 2006
You can read the file after you open it using this method

  1. char stringg[1000] // as large as you need
  2. j=0;
  3. while (c=getc(input_file) !=eof)
  4. {
  5. stringg[j++]=c;
  6. if (c==(the decimal value of your delimiter) break;
  7. }
  8. stringg[j]=0; //terminate the string

your file pointer is advanced to the beginning of the next string
repeat the same procedure for the next string

you can put this in a subroutine or move the string contents somewhere else and reuse the same string again.
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 15,406
Reputation: Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute 
Solved Threads: 1467
Team Colleague
Featured Poster
Ancient Dragon's Avatar
Ancient Dragon Ancient Dragon is offline Offline
Still Learning

Re: Searching for text in huge (> 5GB) text files.

 
0
  #10
Jun 30th, 2006
If the file is pretty much static (doesn't change very often) then you can create an index file that contains just the offsets into the master data file of the beginning of each record. Then when you want to read the 50th string just read the offset in the 50th record of the index file then seek to that position in the master data file. Each record in the index file is a 64-bit integer, so it is eash to seek to the desired record in the index file when you already know the record number.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC