| | |
Searching for text in huge (> 5GB) text files.
Please support our C++ advertiser: Intel Parallel Studio Home
![]() |
So here's what I'm trying to do. Basically, I have a huge text file of strings which are delimited by a delimiter. I would like to read each string separately based on the delimiter. Note that each chunk can span several lines, so I don't want to read it line by line. See this example:
Given that the file is 5GB, I can't read the entire contents of the file into memory. Is there a way do C (or C++) file handling with fseek or equivalent that will put out each chunk? I would like to do this as efficiently as possible, so I don't want to read each line at a time and concatenate the strings if I don't see the delimiter, etc. Is there an easy way?
(edit - unfortunately, this text editor won't properly show the fact that some of the text is on multiple lines, etc - hopefully the point was conveyed based on the wording)
C++ Syntax (Toggle Plain Text)
THIS IS ONE STRING~DELIM~THIS IS ANOTHER~DELIM~THIS IS ANOTHER STRING THAT SPANS TWO LINES~DELIM~THIS IS A THIRD ONE THAT SPANS SEVERAL LINESAND THEREFORE MAKES IT HARDER TO SEPARATE FROM OTHER STRINGS ETC.~DELIM
Given that the file is 5GB, I can't read the entire contents of the file into memory. Is there a way do C (or C++) file handling with fseek or equivalent that will put out each chunk? I would like to do this as efficiently as possible, so I don't want to read each line at a time and concatenate the strings if I don't see the delimiter, etc. Is there an easy way?
(edit - unfortunately, this text editor won't properly show the fact that some of the text is on multiple lines, etc - hopefully the point was conveyed based on the wording)
Last edited by winbatch; Jun 27th, 2006 at 6:42 pm.
I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.
•
•
•
•
Originally Posted by cscgal
I'm not really a c++ programmer but is there any reason why you aren't using a database? Perhaps a 5 gig file isn't the most efficient means of storing what you're trying to store - especially if you need the ability to do a linear search through it.
Is one-character-at-a-time possible? What is the intended output?
"One of the methods used by statists to destroy capitalism consists in establishing controls that tie a given industry hand and foot, making it unable to solve its problems, then declaring that freedom has failed and stronger controls are necessary." --Ayn Rand
•
•
Join Date: Jun 2006
Posts: 47
Reputation:
Solved Threads: 4
use fopen and simple code a while loop not to end until the end of the file next get a pointer to the bigging of the file and read the first contents into the file now write that to a text file after that free the pointer and tell the program to skip one delimiter and the second time to skip two............
http://www.cplusplus.com/ref/iostrea...m/getline.html
getline allows you to specify a delimiter of your choice, if it's a single char.
getline allows you to specify a delimiter of your choice, if it's a single char.
•
•
Join Date: May 2006
Posts: 19
Reputation:
Solved Threads: 0
You can read the file after you open it using this method
your file pointer is advanced to the beginning of the next string
repeat the same procedure for the next string
you can put this in a subroutine or move the string contents somewhere else and reuse the same string again.
C++ Syntax (Toggle Plain Text)
char stringg[1000] // as large as you need j=0; while (c=getc(input_file) !=eof) { stringg[j++]=c; if (c==(the decimal value of your delimiter) break; } stringg[j]=0; //terminate the string
your file pointer is advanced to the beginning of the next string
repeat the same procedure for the next string
you can put this in a subroutine or move the string contents somewhere else and reuse the same string again.
If the file is pretty much static (doesn't change very often) then you can create an index file that contains just the offsets into the master data file of the beginning of each record. Then when you want to read the 50th string just read the offset in the 50th record of the index file then seek to that position in the master data file. Each record in the index file is a 64-bit integer, so it is eash to seek to the desired record in the index file when you already know the record number.
![]() |
Similar Threads
Other Threads in the C++ Forum
- Previous Thread: HELP!! don't understand c++ error messages
- Next Thread: Calling a var with other var value
| Thread Tools | Search this Thread |
api array arrays based binary bitmap c++ c/c++ calculator char char* class code coding compile console conversion count data database delete deploy desktop developer dll download dynamic dynamiccharacterarray email encryption error file forms fstream function functions game getline givemetehcodez google graph gui homeworkhelp iamthwee ifstream input int java lib linkedlist linker list loop looping loops map math matrix memory multiple news node number numbertoword output pointer problem program programming project python random read recursion recursive reference rpg sorting string strings temperature template templates test text text-file tree unix url variable vector video visual visualstudio win32 windows winsock word wordfrequency wxwidgets







