| | |
C File handling - search within file without reading content?
Please support our C++ advertiser: Intel Parallel Studio Home
![]() |
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.
The reason I ask this is that I am dealing with huge files (let's say over 5GB), and I don't want to attempt to read the entire file into a string to search for things - I just want to 'index' it, then read the portions of the file I need. What I'm seeing though in all the C and C++ examples is that if I want to go to a particular position in a file, I need to know in advance what position to go to rather than letting me search like I can do in memory (ie std::string.find("whatever");
For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.
The reason I ask this is that I am dealing with huge files (let's say over 5GB), and I don't want to attempt to read the entire file into a string to search for things - I just want to 'index' it, then read the portions of the file I need. What I'm seeing though in all the C and C++ examples is that if I want to go to a particular position in a file, I need to know in advance what position to go to rather than letting me search like I can do in memory (ie std::string.find("whatever");
•
•
•
•
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
No, the only way to find something is to look at the contents.
The 3 Laws of the Procrastination Society:
1) Never do today that which can be put off until tomorrow
2) Tomorrow never comes
1) Never do today that which can be put off until tomorrow
2) Tomorrow never comes
•
•
Join Date: Dec 2006
Posts: 1,089
Reputation:
Solved Threads: 164
•
•
•
•
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
•
•
•
•
For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.
C++ Syntax (Toggle Plain Text)
//23456<890<2345678<0123<567890 //45<789<1234567<90123456<890 #include <iostream> #include <fstream> #include <vector> #include <iterator> #include <algorithm> using namespace std; int main() { vector<streampos> index ; // filepos where we find a '<' // create index { ifstream file(__FILE__) ; enum { BUFFER_SIZE = 1024*1024*256 } ; // a larger buffer can improve vector<char> large_buffer(BUFFER_SIZE) ; // performance for very large files file.rdbuf()->pubsetbuf( &large_buffer.front(), large_buffer.size() ); file >> noskipws ; istream_iterator<char> begin(file), end ; begin = find( begin, end, '<' ) ; while( begin != end ) { index.push_back( file.tellg() + streamoff(-1) ) ; begin = find( ++begin, end, '<' ) ; } } // verify that index contains the right offsets copy( index.begin(), index.end(), ostream_iterator<streampos>(cout," ") ) ; cout << '\n' ; ifstream file(__FILE__) ; for( vector<streampos>::size_type i = 0U ; i<index.size() ; ++i ) { file.seekg( index[i] ) ; char c ; file.get(c) ; cout << c << ' ' ; } cout << '\n' ; }
g++ -std=c++98 -Wall ./create_index.cpp ; ./a.out
7 11 19 24 36 40 48 57 71 91 110 128 148 204 252 411 590 647 777 903 931 932 985 1018 1098 1099 1103 1104 1122 1123
< < < < < < < < < < < < < < < < < < < < < < < < < < < < < <
note: this is on a freebsd system; on windows there would be two (not one) characters at end of line.
Last edited by vijayan121; Apr 14th, 2007 at 4:03 am.
If this is on an MS-Windows machine, them you may have to use win32 api file i/o directly on that large a file. See CreateFile() to open the file, and ReadFile() to read its contents. All those functions can work with huge files.
Don't PM me with questions -- you might get a nasty PM in response. If you have a question then post it in one of the forums.
•
•
Join Date: Dec 2006
Posts: 1,089
Reputation:
Solved Threads: 164
true, unless you are using a standard library implementation like one from dinkumware, and that too on a
64-bit architecture.
here is something you could try.
a. map chunks of the file (say 256 MB each) into memory.
how you would do this depends on the platform:
unix: use mmap (compile with -D_FILE_OFFSET_BITS=64 to make sure that off_t is a 64-bit value.
linux: same as unix, but i think kernels prior to something like 2.6.10 are buggy with large
files which are memory mapped.
windows: the CreateFile/CreateFileMapping/MapViewOfFile triplet
b. wrap an stlsoft::basic_string_view<char> around the chunk that is mapped.
eg. stlsoft::basic_string_view<char> str( static_cast<const char*>(address), nchars ) ;
download stlsoft from http://www.synesis.com.au/software/stlsoft.
for basic_string_view<> documentation, see:
http://www.synesis.com.au/software/s...ing__view.html
stlsoft library is header-only; you need only #include the requisite files to access the functionality.
c. stlsoft::basic_string_view<> does not have the find family member functions as in std::string;
but do have provide polymorphic iterators. so, functions like find in the <algorithm> header
could be used.
eg. std::find( str.begin(), str.end(), '*' ) ;
64-bit architecture.
here is something you could try.
a. map chunks of the file (say 256 MB each) into memory.
how you would do this depends on the platform:
unix: use mmap (compile with -D_FILE_OFFSET_BITS=64 to make sure that off_t is a 64-bit value.
linux: same as unix, but i think kernels prior to something like 2.6.10 are buggy with large
files which are memory mapped.
windows: the CreateFile/CreateFileMapping/MapViewOfFile triplet
b. wrap an stlsoft::basic_string_view<char> around the chunk that is mapped.
eg. stlsoft::basic_string_view<char> str( static_cast<const char*>(address), nchars ) ;
download stlsoft from http://www.synesis.com.au/software/stlsoft.
for basic_string_view<> documentation, see:
http://www.synesis.com.au/software/s...ing__view.html
stlsoft library is header-only; you need only #include the requisite files to access the functionality.
c. stlsoft::basic_string_view<> does not have the find family member functions as in std::string;
but do have provide polymorphic iterators. so, functions like find in the <algorithm> header
could be used.
eg. std::find( str.begin(), str.end(), '*' ) ;
Last edited by vijayan121; Apr 14th, 2007 at 1:29 pm.
![]() |
Similar Threads
- C++ Reading from a text file (C++)
- how to move to the second line in C++ .txt file reading? (C++)
- CSV file (C)
- File Saving / Reading - Non Constant Folder? Help needed (Visual Basic 4 / 5 / 6)
Other Threads in the C++ Forum
- Previous Thread: Interest??? Here
- Next Thread: please help me with my homework- priority queue
| Thread Tools | Search this Thread |
api array based beginner binary bitmap c++ c/c++ calculator char char* class code coding compile compiler console conversion count database delete deploy desktop developer dll download dynamic dynamiccharacterarray email encryption error file forms fstream function functions game givemetehcodez google graph gui homeworkhelp homeworkhelper iamthwee ifstream input int integer java lib linkedlist linker list loop looping loops map math memory multiple news node number numbertoword output parameter pointer problem program programming project python random read recursion recursive reference rpg sorting string strings struct temperature template test text text-file tree unix url variable vector video visualstudio win32 windows winsock word wordfrequency wxwidgets






