string search

Question

D33wakar 36 Posting Whiz in Training

12 Years Ago

I am trying to write a program for searching strings(one at a time)in a text file.
To start things off, I wrote this one.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

int main()
   {
	   FILE *txt_file;
	   int chr;/*char read by fgetc*/
	   int word_match=0;
	   const char* substring ="window";/*search word*/
	   int i=0;
	   char word[15];/*to save the word read from the file)*/


	   if((txt_file=fopen("text.txt","r"))==NULL)
		   {
			   printf("Can't open the file\n");
			   exit(1);
			}

	   while((chr=fgetc(txt_file))!=EOF)
		   {
			   word[i]= chr;
			   i++;
			   if(isspace(chr)||ispunct(chr))
				   {
					   word[i-1]='\0';
					  /* printf("%s\n",word);/*testing! testing!*/
					   if(strcmp(word,substring)==0 )
						   {
							   word_match++;
							}
					   i=0;
					}

					if(isspace(word[i]))
						i=0;


			}


		fclose(txt_file);
		printf("Matched words:%d\n",word_match);
		return EXIT_SUCCESS;
	}

This program has got some limitations and is very inefficient too.
I'm looking for suggestions to make it a real world string searching program.

Note:Toggle plain text and copy it somewhere else If it's too hard to read due to indention.

c string-search

Edited 12 Years Ago by D33wakar because: n/a

5 Contributors
21 Replies
220 Views
4 Days Discussion Span
Latest Post 12 Years Ago Latest Post by Narue

Adak 419 Nearly a Posting Virtuoso

12 Years Ago

Use fgets() to take in at least one line of text, at a time, into your buffer. Then use strstr() to look for the target word, in the buffer. You don't want to do this, char by char.

To make it more efficient still, search for string search in Wikipedia, and read up. The best way, depends on how long your target word is.

WaltP 2,905 Posting Sage w/ dash of thyme

12 Years Ago

Note:Toggle plain text and copy it somewhere else If it's too hard to read due to indention.

Note2: Then don't post poorly indented code. Use 4 SPACEs instead of TABs and indent correctly.

Adak 419 Nearly a Posting Virtuoso

12 Years Ago

Ff the target word is found, you can do the rest of your code to handle the target word being found, and then simply break out of the searching loop.

If you want to find the target word 5 times, just break out when the counter == 5.

Seems quite straight forward.

Narue 5,707 Bad Cop

12 Years Ago

Well, obviously strstr() alone won't do the job since it performs an unconditional exact match. You'll need to customize the behavior a little bit to suit your definition of a word:

#include <ctype.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>

bool is_word(int ch)
{
    return !isspace((unsigned char)ch) && ch != '\0';
}

size_t next_word(const char *src, const char *match, size_t pos)
{
    size_t len = strlen(match);
    
    while (src[pos] != '\0') {
        // Find the next word boundary
        while (!is_word(src[pos]))
            ++pos;
        
        // Check for a complete word match
        if (strncmp(&src[pos], match, len) == 0)
            return pos;
            
        // No match: skip the word
        while (is_word(src[pos]))
            ++pos;
    }
    
    return -1;
}

int main(void)
{
    const char *s = "I ask for multitasks";
    const char *match = "ask";
    size_t pos = next_word(s, match, 0);
    
    while (pos != (size_t)-1) {
        printf("[%zd]: '%s'\n", pos, &s[pos]);
        pos = next_word(s, match, pos + strlen(match));
    }

    return 0;
}

Edited 12 Years Ago by Narue because: n/a

Adak 419 Nearly a Posting Virtuoso

12 Years Ago

I believe Narue has graciously answered your questions/concerns. Your point is well taken, that a literal searcher won't give you everything you need, without more coding work.

It seems that to get just the "ask" (whole word), you need logic to handle " ask", except in the case where the "a" in "ask", is the first letter in the line of text. Then it needs to match "ask ", "ask, ", or "ask. ".

Note that:
draw
ing

will certainly require extra code, because it simply isn't a word, until it's re-assembled by your code.

Imo, when using fgets(), make it 1/3rd larger than your biggest line of text, unless you know just what the correct size should be. Memory isn't that precious, and truncating data is usually the last thing you want to do.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

D33wakar 36 Posting Whiz in Training · Answer 1 · 2011-08-24T14:07:07+00:00

Using fgets.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>


int main()
   {
	   FILE *txt_file;
	   const char* substring ="in";/*search word*/
	   int word_match=0,l=strlen(substring);


	   char sentence[25];/*to save the line read from the file)*/


	   if((txt_file=fopen("text.txt","r"))==NULL)
		   {
			   printf("Can't open the file\n");
			   exit(1);
			}

	   while(fgets(sentence,sizeof(sentence),txt_file) )
		   {
			   /*printf("%s\n",sentence);/*testing! testing!*/
			   char* r_string=strstr(sentence,substring);


			   if(r_string!=NULL)
                              {
                                 word_match++;
                              }
                    }


	   fclose(txt_file);
           printf("Matched words:%d\n",word_match);
           return EXIT_SUCCESS;
	}

strstr returns every string that matches the substring.
For example, here I searched for the word 'in' and it returned all the strings containing the characters 'in'.
Though, this may be compulsory in some cases (searching DNA sequences for example),
it is little bit annoying while searching for a "word" in a text.

D33wakar 36 Posting Whiz in Training · Answer 2 · 2011-08-24T18:29:39+00:00

Ff the target word is found, you can do the rest of your code to handle the target word being found, and then simply break out of the searching loop.
If you want to find the target word 5 times, just break out when the counter == 5.
Seems quite straight forward.

No, that's not what I meant.
I need the program to find a "word", that is string between spaces or punctuation signs . Such as,
"I ask for multitasks"--> what I want.
"I ask for multitasks"--> what the program does.

There's another problem using fgets.

while(fgets(sentence,25,txt_file) )

The above code reads 24 chars into the array. And It's not always necessary that "sentence" ends up getting space or newline all the time as the last byte.
For example:

writing the code for dra
wing shapes in the
window.

If the search word was "drawing", it would return "Not Found"

Narue 5,707 Bad Cop Team Colleague · Answer 3 · 2011-08-24T22:58:02+00:00

I believe Narue has graciously answered your questions/concerns.

Not completely, as my example doesn't exactly match the OP's definition of a word or handle words split across two strings. But it's a very strong base to start from.

Imo, when using fgets(), make it 1/3rd larger than your biggest line of text

Or put in a little more effort to accept arbitrary length strings.

Memory isn't that precious

Memory is rarely so precious that you can sacrifice clarity for memory savings. But wasting memory for a false sense of correctness is generally a bad idea. If you're not working with a fixed format, assumptions about line length are a point of failure.

Adak 419 Nearly a Posting Virtuoso · Answer 4 · 2011-08-24T23:14:46+00:00

Very true, Narue.

The OP said he was searching in a "text file", which I took to mean standard lines and having complete words on each line.

If he has hyphenated words, or words that are arbitrarily split into a two lines of text, that would require more effort.

D33wakar 36 Posting Whiz in Training · Answer 5 · 2011-08-25T17:50:59+00:00

Then it's better not to use fgets at all.
I think it's okay to read " char by char " than to go through all the hurdles
using fgets.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

int main()
{
	FILE *txt_file;
	int chr;/*char read by fgetc*/
	int word_match=0;
	const char* substring ="window";/*search word*/
	int i=0;
	char word[15];/*to save the word read from the file)*/

	if((txt_file=fopen("text.txt","r"))==NULL)
		{
		printf("Can't open the file\n");
		exit(1);
		}

	while((chr=fgetc(txt_file))!=EOF )
		{
		word[i]= chr;
		i++;

		if(isspace(chr)||ispunct(chr))
			{
			word[i-1]='\0';
			 printf("%s\n",word);/*testing! testing!*/
			if(strcmp(word,substring)==0 )
				{
				word_match++;
				}
			i=0;
			}
		if(isspace(word[i]))
			i=0;
		}

	fclose(txt_file);
	printf("Matched words:%d\n",word_match);
	return EXIT_SUCCESS;
}

It works fine as my requirements(up until now).

WaltP 2,905 Posting Sage w/ dash of thyme Team Colleague · Answer 6 · 2011-08-26T03:18:13+00:00

Then it's better not to use fgets at all.

Why?

I think it's okay to read " char by char " than to go through all the hurdles
using fgets.

What hurdles?

What is the difference between reading character by character

while((chr=fgetc(txt_file))!=EOF )
    {
        word[i]= chr;
        i++;
        ...
    }

and reading a line at a time

fgets(buf, len, txt_file);
    idx = 0;
    while(buf[idx])
    {
        word[i]= buf[idx];
        i++;
        ...
        idx++;
    }

The fgets() code should be faster and conceptually no different.

D33wakar 36 Posting Whiz in Training · Answer 7 · 2011-08-26T10:46:37+00:00

Why?

Yeah, for that see previous posts in this thread.

What hurdles?

uh.. same answer.

maninaction -9 Junior Poster in Training · Answer 8 · 2011-08-26T10:58:32+00:00

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

int main()
   {
	   FILE *txt_file;
	   int chr;/*char read by fgetc*/
	   int word_match=0;
	   const char* substring ="window";/*search word*/
	   int i=0;
	   char word[15];/*to save the word read from the file)*/


	   if((txt_file=fopen("text.txt","r"))==NULL)
		   {
			   printf("Can't open the file\n");
			   exit(1);
			}

	   while((chr=fgetc(txt_file))!=EOF)
		   {
			   word[i]= chr;
			   i++;
			   if(isspace(chr)||ispunct(chr))
				   {
					   word[i-1]='\0';
					  /* printf("%s\n",word);/*testing! testing!*/
					   if(strcmp(word,substring)==0 )
						   {
							   word_match++;
							}
					   i=0;
					}

					if(isspace(word[i]))
						i=0;


			}


		fclose(txt_file);
		printf("Matched words:%d\n",word_match);
		return EXIT_SUCCESS;
	}

use this code with more string functions or you also with the famous techniques and algorithms like string match alogrithm

Adak 419 Nearly a Posting Virtuoso · Answer 9 · 2011-08-26T13:08:29+00:00

I don't mean to complain, but after 41 posts manicaction, you should be putting your code between code tags on the forum, every time. Very few of us are going to spend time looking over code that looks like html.

Use fgets(), or don't, it's your choice. There's more than one way to do this, clearly.

Did you try Narue's code? Did you try Manicaction's code?

I just thought of a simple way that might appeal to you. Say the word being searched for is "draw".

open the file, and set int match = 0.

Until the end of the file is reached. In a loop, look for a 'd' char. When you find
one, match++, and check the next char. If it's an 'r', then match++, and continue looping. Break out of the loop when you reach a space, or a punctuation (use ispunct() ), or exceed the strlen() of the target word.

If you reach a newline, and the match variable equals the strlen() of the target word, then you've found an instance of the word, in the text. If the match variable is less than the strlen() of the target word, then go to the next line of text, and continue seeing if the char's match the next letter in the target word.

The first letter in the text that doesn't match the target word (except the above case), causes the program to break out of the loop.

In any case, if the program has finished the loop, and the match equals the strlen() of the target word, then you have a match. If match doesn't equal the target word string length, then you have no match.

It's not the fastest string searcher, but it is perhaps the easiest, and with today's systems, should be plenty fast enough.

If the files aren't HUGE, I would use fseek() and ftell() to get the correct length of the file, malloc the memory, and work with it all from memory - especially if you want to search for more than one target word.

So far, I don't see any "hurdles" for this method. What do you see?

WaltP 2,905 Posting Sage w/ dash of thyme Team Colleague · Answer 10 · 2011-08-26T13:20:58+00:00

I see no hurdles mentioned and no reason not to use fgets() in any posts. Care to elaborate?

No one has offered this suggestion:
1) read a line with fgets() 2) use strstr() to find your word
3) check the character before the word and if it's an alpha character, it's not your word.
4) check the character after the word and if it's an alpha character, it's not your word.

Adak 419 Nearly a Posting Virtuoso · Answer 11 · 2011-08-26T14:35:43+00:00

He gives the example of a line of text, not having the entire word, like draw
ing.

Using fgets() and strstr(), for "drawing", you wouldn't find the word if it was split like this.

I agree that's a simple and sweet algorithm Walt, but it can't handle split words without "hurdles". ;)

WaltP 2,905 Posting Sage w/ dash of thyme Team Colleague · Answer 12 · 2011-08-27T02:33:48+00:00

Then read the entire document ( read() ), remove all \n's and search.

You can't find words that are split on a line without some kind of hurdle.

D33wakar 36 Posting Whiz in Training · Answer 13 · 2011-08-27T13:17:55+00:00

Alright, here's the code(incomplete) that I managed to write after seeing Adak’s algorithm.

...
while((chr=fgetc(txt_file))!=EOF)
    {
        if(chr=='\n')/* NOT REQUIRED*/
            continue;
        else if(chr==*(substring+i))
            {   
                i++;
                chr_match++;
                if(chr_match==strlen(substring))
                word_match++;
            }
        else
            i=chr_match=0;
    }
...

I still need to check if the words found are valid.I'm thinking of using fseek() to check if the char before and after the word is space or punctuation.What's your take on that?

Any way, I was amazed to see that the code doesn't have "splitted word" problem(Works fine without line 4 ). How come?

Adak 419 Nearly a Posting Virtuoso · Answer 14 · 2011-08-27T14:50:22+00:00

Because '\n' is not part of your *substring

I wouldn't use fseek(), unless it was the only way possible to make it work. Any time you can avoid making the disk head move around, I'd avoid it. On big files or lots of runs, it introduces relatively long delays (shorter delays with SSD's, but still a delay), and extra wear and tear on disk heads, if you have the data on HD's.

I'd go with a "flag" (a boolean type variable), that you set when the match == strlen(substring). Then, when it loops and a char has been stored in ch, add an if statement, if ch is a space, or a punctuation, then the match is confirmed, so handle it then, and of course, reset your variables, including the flag.

D33wakar 36 Posting Whiz in Training · Answer 15 · 2011-08-27T19:04:20+00:00

while((chr=fgetc(txt_file))!=EOF)
{
    if(chr==*(substring+i) || isspace(chr)||ispunct(chr))
    {
        if(isspace(chr)||ispunct(chr))
        {
            flag=1;
            continue;
        }
        
        if(flag)
        {
            i++;
            chr_match++;
        }
        
        if(chr_match==strlen(substring) && flag)
        {
            flag=0;
            if(isspace(ch=fgetc(txt_file)) || ispunct(ch=fgetc(txt_file)))/* there must be something wrong here*/
                word_match++;
        }

    }
  
    else
        i=chr_match=flag=0;

}

I'm almost there, there's a small problem in line 20 in above code.The second part of the condition ispunct(ch=fgetc(txt_file)) is not effective at all.
It was kind of a intuitional you know( it may look ridiculous).

Narue 5,707 Bad Cop Team Colleague · Answer 16 · 2011-08-27T19:14:19+00:00

You're reading two characters, not testing the first one twice. Consider this instead:

if (isspace(ch = fgetc(txt_file)) || ispunct(ch))
    ++word_match;

Also keep in mind that while C allows you a great deal of freedom in structuring elements, taking advantage of that freedom is not always a good idea. Your code would be better written like so for readability purposes:

ch = fgetc(txt_file);

if (isspace(ch) || ispunct(ch))
    ++word_match;