C File handling - search within file without reading content?

Please support our C++ advertiser: Intel Parallel Studio Home
Reply

Join Date: Feb 2005
Posts: 466
Reputation: winbatch is on a distinguished road 
Solved Threads: 18
winbatch's Avatar
winbatch winbatch is offline Offline
Posting Pro in Training

C File handling - search within file without reading content?

 
0
  #1
Apr 13th, 2007
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?

For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.

The reason I ask this is that I am dealing with huge files (let's say over 5GB), and I don't want to attempt to read the entire file into a string to search for things - I just want to 'index' it, then read the portions of the file I need. What I'm seeing though in all the C and C++ examples is that if I want to go to a particular position in a file, I need to know in advance what position to go to rather than letting me search like I can do in memory (ie std::string.find("whatever");
Reply With Quote Quick reply to this message  
Join Date: May 2006
Posts: 1,582
Reputation: Infarction has a spectacular aura about Infarction has a spectacular aura about Infarction has a spectacular aura about 
Solved Threads: 52
Infarction's Avatar
Infarction Infarction is offline Offline
Battle Programmer

Re: C File handling - search within file without reading content?

 
0
  #2
Apr 13th, 2007
I think you'll just have to search through it one bufferSize at a time... you'll certainly have to read it to see where the data is...
Reply With Quote Quick reply to this message  
Join Date: May 2006
Posts: 3,114
Reputation: WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of WaltP has much to be proud of 
Solved Threads: 281
Moderator
WaltP's Avatar
WaltP WaltP is offline Offline
Posting Sensei

Re: C File handling - search within file without reading content?

 
0
  #3
Apr 14th, 2007
Originally Posted by winbatch View Post
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
Is it possible to find a file in a file cabinet without opening the cabinet?

No, the only way to find something is to look at the contents.
The 3 Laws of the Procrastination Society:
1) Never do today that which can be put off until tomorrow
2) Tomorrow never comes
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,089
Reputation: vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all 
Solved Threads: 164
vijayan121 vijayan121 is offline Offline
Veteran Poster

Re: C File handling - search within file without reading content?

 
0
  #4
Apr 14th, 2007
Originally Posted by winbatch View Post
Is there a way using file handling functions to search within a text file to find the position of certain text WITHOUT having to read each character/line and evaluating it?
the short answer is no. but you may not have to do this; let the standard library do the work.
For example, let's say I have a file of 'messages', where a message has a distinct start and end characters. I would like to go through the file and locate the positions of each start of a message. I would then index this (let's say with an stl map of each message # to it's starting position in the file. Later on, I could then ask for a particular message # by jumping to that position in the file (let's say with fseek) and read just that portion of the file.
here is an example of how to do this:
  1. //23456<890<2345678<0123<567890
  2. //45<789<1234567<90123456<890
  3. #include <iostream>
  4. #include <fstream>
  5. #include <vector>
  6. #include <iterator>
  7. #include <algorithm>
  8. using namespace std;
  9.  
  10. int main()
  11. {
  12. vector<streampos> index ; // filepos where we find a '<'
  13.  
  14. // create index
  15. {
  16. ifstream file(__FILE__) ;
  17. enum { BUFFER_SIZE = 1024*1024*256 } ; // a larger buffer can improve
  18. vector<char> large_buffer(BUFFER_SIZE) ; // performance for very large files
  19. file.rdbuf()->pubsetbuf( &large_buffer.front(), large_buffer.size() );
  20. file >> noskipws ;
  21. istream_iterator<char> begin(file), end ;
  22. begin = find( begin, end, '<' ) ;
  23. while( begin != end )
  24. {
  25. index.push_back( file.tellg() + streamoff(-1) ) ;
  26. begin = find( ++begin, end, '<' ) ;
  27. }
  28. }
  29.  
  30. // verify that index contains the right offsets
  31. copy( index.begin(), index.end(), ostream_iterator<streampos>(cout," ") ) ;
  32. cout << '\n' ;
  33. ifstream file(__FILE__) ;
  34. for( vector<streampos>::size_type i = 0U ; i<index.size() ; ++i )
  35. { file.seekg( index[i] ) ; char c ; file.get(c) ; cout << c << ' ' ; }
  36. cout << '\n' ;
  37. }
here is the output:
g++ -std=c++98 -Wall ./create_index.cpp ; ./a.out
7 11 19 24 36 40 48 57 71 91 110 128 148 204 252 411 590 647 777 903 931 932 985 1018 1098 1099 1103 1104 1122 1123
< < < < < < < < < < < < < < < < < < < < < < < < < < < < < <

note: this is on a freebsd system; on windows there would be two (not one) characters at end of line.
Last edited by vijayan121; Apr 14th, 2007 at 4:03 am.
Reply With Quote Quick reply to this message  
Join Date: Feb 2005
Posts: 466
Reputation: winbatch is on a distinguished road 
Solved Threads: 18
winbatch's Avatar
winbatch winbatch is offline Offline
Posting Pro in Training

Re: C File handling - search within file without reading content?

 
0
  #5
Apr 14th, 2007
vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 15,343
Reputation: Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute 
Solved Threads: 1458
Team Colleague
Featured Poster
Ancient Dragon's Avatar
Ancient Dragon Ancient Dragon is offline Offline
Still Learning

Re: C File handling - search within file without reading content?

 
0
  #6
Apr 14th, 2007
Originally Posted by winbatch View Post
vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)
If this is on an MS-Windows machine, them you may have to use win32 api file i/o directly on that large a file. See CreateFile() to open the file, and ReadFile() to read its contents. All those functions can work with huge files.
Don't PM me with questions -- you might get a nasty PM in response. If you have a question then post it in one of the forums.
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,089
Reputation: vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all vijayan121 is a name known to all 
Solved Threads: 164
vijayan121 vijayan121 is offline Offline
Veteran Poster

Re: C File handling - search within file without reading content?

 
0
  #7
Apr 14th, 2007
Originally Posted by winbatch View Post
vijayan121, I don't think ifstream works for files over 2GB, which mine is. (7GB)
true, unless you are using a standard library implementation like one from dinkumware, and that too on a
64-bit architecture.

here is something you could try.

a. map chunks of the file (say 256 MB each) into memory.
how you would do this depends on the platform:
unix: use mmap (compile with -D_FILE_OFFSET_BITS=64 to make sure that off_t is a 64-bit value.
linux: same as unix, but i think kernels prior to something like 2.6.10 are buggy with large
files which are memory mapped.
windows: the CreateFile/CreateFileMapping/MapViewOfFile triplet

b. wrap an stlsoft::basic_string_view<char> around the chunk that is mapped.
eg. stlsoft::basic_string_view<char> str( static_cast<const char*>(address), nchars ) ;

download stlsoft from http://www.synesis.com.au/software/stlsoft.
for basic_string_view<> documentation, see:
http://www.synesis.com.au/software/s...ing__view.html
stlsoft library is header-only; you need only #include the requisite files to access the functionality.

c. stlsoft::basic_string_view<> does not have the find family member functions as in std::string;
but do have provide polymorphic iterators. so, functions like find in the <algorithm> header
could be used.
eg. std::find( str.begin(), str.end(), '*' ) ;
Last edited by vijayan121; Apr 14th, 2007 at 1:29 pm.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC