Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?

For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.

The reason I ask this is that I am dealing with huge files (let's say over 5GB), and I don't want to attempt to read the entire file into a string to search for things - I just want to 'index' it, then read the portions of the file I need. What I'm seeing though in all the C and C++ examples is that if I want to go to a particular position in a file, I need to know in advance what position to go to rather than letting me search like I can do in memory (ie std::string.find("whatever");

Recommended Answers

All 6 Replies

I think you'll just have to search through it one bufferSize at a time... you'll certainly have to read it to see where the data is...

Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?

Is it possible to find a file in a file cabinet without opening the cabinet?

No, the only way to find something is to look at the contents.

Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?

the short answer is no. but you may not have to do this; let the standard library do the work.

For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.

here is an example of how to do this:

//23456<890<2345678<0123<567890
//45<789<1234567<90123456<890
#include <iostream>
#include <fstream>
#include <vector>
#include <iterator>
#include <algorithm>
using namespace std;

int main()
{
   vector<streampos> index ; // filepos where we find a '<' 
   
   // create index
   {
    ifstream file(__FILE__) ; 
    enum { BUFFER_SIZE = 1024*1024*256 } ; // a larger buffer can improve
    vector<char> large_buffer(BUFFER_SIZE) ; //  performance for very large files
    file.rdbuf()->pubsetbuf( &large_buffer.front(), large_buffer.size() ); 
    file >> noskipws ; 
    istream_iterator<char> begin(file), end ;
    begin = find( begin, end, '<' ) ;
    while( begin != end )
    {
      index.push_back( file.tellg() + streamoff(-1) ) ;
      begin = find( ++begin, end, '<' ) ;
    }
   }
   
   // verify that index contains the right offsets
   copy( index.begin(), index.end(), ostream_iterator<streampos>(cout," ") ) ;
   cout << '\n' ;
   ifstream file(__FILE__) ; 
   for( vector<streampos>::size_type i = 0U ; i<index.size() ; ++i )
   { file.seekg( index[i] ) ; char c ; file.get(c) ; cout << c << ' ' ; }
   cout << '\n' ;
}

here is the output:
g++ -std=c++98 -Wall ./create_index.cpp ; ./a.out
7 11 19 24 36 40 48 57 71 91 110 128 148 204 252 411 590 647 777 903 931 932 985 1018 1098 1099 1103 1104 1122 1123
< < < < < < < < < < < < < < < < < < < < < < < < < < < < < <

note: this is on a freebsd system; on windows there would be two (not one) characters at end of line.

vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)

vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)

If this is on an MS-Windows machine, them you may have to use win32 api file i/o directly on that large a file. See CreateFile() to open the file, and ReadFile() to read its contents. All those functions can work with huge files.

vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)

true, unless you are using a standard library implementation like one from dinkumware, and that too on a
64-bit architecture.

here is something you could try.

a. map chunks of the file (say 256 MB each) into memory.
how you would do this depends on the platform:
unix: use mmap (compile with -D_FILE_OFFSET_BITS=64 to make sure that off_t is a 64-bit value.
linux: same as unix, but i think kernels prior to something like 2.6.10 are buggy with large
files which are memory mapped.
windows: the CreateFile/CreateFileMapping/MapViewOfFile triplet

b. wrap an stlsoft::basic_string_view<char> around the chunk that is mapped.
eg. stlsoft::basic_string_view<char> str( static_cast<const char*>(address), nchars ) ;

download stlsoft from http://www.synesis.com.au/software/stlsoft.
for basic_string_view<> documentation, see:
http://www.synesis.com.au/software/stlsoft/doc-1.9/classstlsoft_1_1basic__string__view.html
stlsoft library is header-only; you need only #include the requisite files to access the functionality.

c. stlsoft::basic_string_view<> does not have the find family member functions as in std::string;
but do have provide polymorphic iterators. so, functions like find in the <algorithm> header
could be used.
eg. std::find( str.begin(), str.end(), '*' ) ;

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.