943,919 Members | Top Members by Rank

Ad:
  • C++ Discussion Thread
  • Unsolved
  • Views: 6303
  • C++ RSS
Apr 13th, 2007
0

C File handling - search within file without reading content?

Expand Post »
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?

For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.

The reason I ask this is that I am dealing with huge files (let's say over 5GB), and I don't want to attempt to read the entire file into a string to search for things - I just want to 'index' it, then read the portions of the file I need. What I'm seeing though in all the C and C++ examples is that if I want to go to a particular position in a file, I need to know in advance what position to go to rather than letting me search like I can do in memory (ie std::string.find("whatever");
Similar Threads
Reputation Points: 68
Solved Threads: 18
Posting Pro in Training
winbatch is offline Offline
466 posts
since Feb 2005
Apr 13th, 2007
0

Re: C File handling - search within file without reading content?

I think you'll just have to search through it one bufferSize at a time... you'll certainly have to read it to see where the data is...
Reputation Points: 683
Solved Threads: 53
Posting Virtuoso
Infarction is offline Offline
1,580 posts
since May 2006
Apr 14th, 2007
0

Re: C File handling - search within file without reading content?

Click to Expand / Collapse  Quote originally posted by winbatch ...
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
Is it possible to find a file in a file cabinet without opening the cabinet?

No, the only way to find something is to look at the contents.
Moderator
Reputation Points: 3278
Solved Threads: 894
Posting Sage
WaltP is offline Offline
7,738 posts
since May 2006
Apr 14th, 2007
0

Re: C File handling - search within file without reading content?

Click to Expand / Collapse  Quote originally posted by winbatch ...
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
the short answer is no. but you may not have to do this; let the standard library do the work.
Quote ...
For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.
here is an example of how to do this:
C++ Syntax (Toggle Plain Text)
  1. //23456<890<2345678<0123<567890
  2. //45<789<1234567<90123456<890
  3. #include <iostream>
  4. #include <fstream>
  5. #include <vector>
  6. #include <iterator>
  7. #include <algorithm>
  8. using namespace std;
  9.  
  10. int main()
  11. {
  12. vector<streampos> index ; // filepos where we find a '<'
  13.  
  14. // create index
  15. {
  16. ifstream file(__FILE__) ;
  17. enum { BUFFER_SIZE = 1024*1024*256 } ; // a larger buffer can improve
  18. vector<char> large_buffer(BUFFER_SIZE) ; // performance for very large files
  19. file.rdbuf()->pubsetbuf( &large_buffer.front(), large_buffer.size() );
  20. file >> noskipws ;
  21. istream_iterator<char> begin(file), end ;
  22. begin = find( begin, end, '<' ) ;
  23. while( begin != end )
  24. {
  25. index.push_back( file.tellg() + streamoff(-1) ) ;
  26. begin = find( ++begin, end, '<' ) ;
  27. }
  28. }
  29.  
  30. // verify that index contains the right offsets
  31. copy( index.begin(), index.end(), ostream_iterator<streampos>(cout," ") ) ;
  32. cout << '\n' ;
  33. ifstream file(__FILE__) ;
  34. for( vector<streampos>::size_type i = 0U ; i<index.size() ; ++i )
  35. { file.seekg( index[i] ) ; char c ; file.get(c) ; cout << c << ' ' ; }
  36. cout << '\n' ;
  37. }
here is the output:
g++ -std=c++98 -Wall ./create_index.cpp ; ./a.out
7 11 19 24 36 40 48 57 71 91 110 128 148 204 252 411 590 647 777 903 931 932 985 1018 1098 1099 1103 1104 1122 1123
< < < < < < < < < < < < < < < < < < < < < < < < < < < < < <

note: this is on a freebsd system; on windows there would be two (not one) characters at end of line.
Last edited by vijayan121; Apr 14th, 2007 at 4:03 am.
Reputation Points: 1159
Solved Threads: 285
Posting Virtuoso
vijayan121 is offline Offline
1,606 posts
since Dec 2006
Apr 14th, 2007
0

Re: C File handling - search within file without reading content?

vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)
Reputation Points: 68
Solved Threads: 18
Posting Pro in Training
winbatch is offline Offline
466 posts
since Feb 2005
Apr 14th, 2007
0

Re: C File handling - search within file without reading content?

Click to Expand / Collapse  Quote originally posted by winbatch ...
vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)
If this is on an MS-Windows machine, them you may have to use win32 api file i/o directly on that large a file. See CreateFile() to open the file, and ReadFile() to read its contents. All those functions can work with huge files.
Sponsor
Team Colleague
Featured Poster
Reputation Points: 5608
Solved Threads: 2282
Retired and Enjoying Life
Ancient Dragon is offline Offline
21,953 posts
since Aug 2005
Apr 14th, 2007
0

Re: C File handling - search within file without reading content?

Click to Expand / Collapse  Quote originally posted by winbatch ...
vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)
true, unless you are using a standard library implementation like one from dinkumware, and that too on a
64-bit architecture.

here is something you could try.

a. map chunks of the file (say 256 MB each) into memory.
how you would do this depends on the platform:
unix: use mmap (compile with -D_FILE_OFFSET_BITS=64 to make sure that off_t is a 64-bit value.
linux: same as unix, but i think kernels prior to something like 2.6.10 are buggy with large
files which are memory mapped.
windows: the CreateFile/CreateFileMapping/MapViewOfFile triplet

b. wrap an stlsoft::basic_string_view<char> around the chunk that is mapped.
eg. stlsoft::basic_string_view<char> str( static_cast<const char*>(address), nchars ) ;

download stlsoft from http://www.synesis.com.au/software/stlsoft.
for basic_string_view<> documentation, see:
http://www.synesis.com.au/software/s...ing__view.html
stlsoft library is header-only; you need only #include the requisite files to access the functionality.

c. stlsoft::basic_string_view<> does not have the find family member functions as in std::string;
but do have provide polymorphic iterators. so, functions like find in the <algorithm> header
could be used.
eg. std::find( str.begin(), str.end(), '*' ) ;
Last edited by vijayan121; Apr 14th, 2007 at 1:29 pm.
Reputation Points: 1159
Solved Threads: 285
Posting Virtuoso
vijayan121 is offline Offline
1,606 posts
since Dec 2006

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in C++ Forum Timeline: Interest??? Here
Next Thread in C++ Forum Timeline: please help me with my homework- priority queue





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC