943,547 Members | Top Members by Rank

Ad:
  • C++ Discussion Thread
  • Unsolved
  • Views: 3367
  • C++ RSS
Jul 27th, 2004
0

Searching in documents...

Expand Post »
Hi,
I want to implement as search program which will be able to search through a given set of documents for a given string...I've heard that indexed searches are the best bet in these cases...(if anyone knows of something better i would appreciate it..)..now if i have a set of index keywords for a given document and all the keywords for the whole set of documents, my search then would be quite simple.The problem however is creating this index..that is, how do i scan a document and list out its keywords.

for eg in the sentence below:

The Velociraptor had a height of 6 feet.

here words like 'velociraptor' , height and '6 feet' would be keywords..
Well i have thought of basically removing the filler words like 'the' 'a' etc...
Is there a better method? Otherwise does anyone have any ideas on how I can refine the above method?
Thanks in advance..
Piyush
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
crazybitez is offline Offline
8 posts
since Jul 2004
Jul 27th, 2004
0

Re: Searching in documents...

You could create a dictionary of ALL words. That would include 'the' 'had' 'a' and the like. If someone types in a search for "velociraptor run" that would exclude your sentence above because 'run' is not there, even though you might not consider 'run' a keyword.

Alternatively you could make a list of common words you want to exclude, like "the", "a", "and", "had", "have", etc.
Reputation Points: 36
Solved Threads: 11
Posting Pro in Training
Chainsaw is offline Offline
436 posts
since Jun 2004
Jul 27th, 2004
0

Re: Searching in documents...

see " Advanced string technique " it C++ tutorial .. btw ..
The C++ standard library provides a class called string. Its been used in several programs so far. To use it you need

#include <string>
using namespace std;

Note the (tiny) difference (there is also the using namespace std;. The components of the C++ standard library belong to a namespace called std. The using statement allows the code to refer to the components without having to write std:: in front, as in std::string x ) to the corresponding statement for C-style strings: for C-style strings you say
#include <string.h>.

You can create a string with

string s = "this is what I want";
You can navigate around a string with the array-indexing notation s[i], with the first character being s[0]. Despite appearances, C++ strings are neither arrays, nor pointers!. The fact you can do this is due to C++ feature known as operator-overloading. Suffice to say that the class is defined in such a way as to support the array indexing operation. There is also a navigation device called an iterator, functionally like a pointer, which we wont be going into.

string is a class, and a function or method m of the class is applied to an object x of the class with the x.m(...) syntax of object oriented programming. So in reading the following tables of functions bear this mind

For finding the size of a string x and searching in it, you need:
C++ Syntax (Toggle Plain Text)
  1. int x.size() size of string
  2. int x.find(char c) search for the character c, start at beginning
  3. int x.find(string s) search for string s, start at beginning
  4. int x.find(char c, int pos) search for the character c, start at pos
  5. int x.find(string s, int pos) search for s, start at pos
  6. string::npos value returned when search fails
The return values of the search functions are integers - often called an offset. If the x.find(...) returns offset i, then x[i] is the character where the match begins. Also you can give an offset to specify the start of the search. Where the C-style search functions relied on the NULL pointer for exceptional situations, the C++ search functions rely on the cryptic (no position ?):

string::npos.
Here is an example of searching (compare with ):
C++ Syntax (Toggle Plain Text)
  1. #include <string>
  2. using namespace std;
  3.  
  4. /* to count how often a char occurs */
  5. string s = "aabbaa";
  6. int count = 0;
  7. int offset = 0;
  8.  
  9. while((offset = s.find('a',offset)) != string::npos) {
  10. count++;
  11. offset++;
  12. }
  13.  
  14. /* to count how often a string occurs */
  15. string s = "aabbaa";
  16. int count = 0;
  17. int offset = 0;
  18.  
  19. while((offset = s.find("aa",offset)) != string::npos) {
  20. count++;
  21. offset = offset+2;
  22. }
There are lots of string-creating, and string-updating methods, and the most frequently used are given in the table below. They can be used without any of the qualms or headaches of the corresponding operations on C-style strings. These methods generally have at least 1 and sometimes 2 integer arguments. The first argument will be a start position o for the operation, and the second argument will specify the size l of the operation to be carried out.
C++ Syntax (Toggle Plain Text)
  1. string(char *s) create string from char pointer
  2. char * x.c_str(void) make char pointer from string
  3. string x.substr(int o) return substring from o onwards
  4. string x.substr(int o, int l) returns substring starting at o, of length l
  5. x.erase(int o, int l) remove l chars from o onwards
  6. x.insert(int o, string s) insert s at o
  7. s1 + s2 give string representing concatenation of s1 and s2
Note that in the above only erase and insert change the object that they are invoked on. substr returns a string object generated out of the unchanged string object it was invoked on.
I will give a substantial example of string processing in tutorial after vector has been introduced.

A final important facility provided by string is a function to obtain a whole line of input from a stream. Where file is an fstream and line a string, you can say:


getline(file,line)
Recall that the >> mechanism stops at white space. This getline function reads everything up to a new-line character. It does not put the new-line character into the line but it does take it out of the stream. It can be treated as returning false when the end of the file is reached.
Team Colleague
Reputation Points: 55
Solved Threads: 3
Junior Poster
meabed is offline Offline
139 posts
since May 2004

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in C++ Forum Timeline: 1 do while with 2 delay (how u do that??)
Next Thread in C++ Forum Timeline: Catching exceptions in C++





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC