Hi Guys,

Just wondering I'm newbie writing C programs and I don't expect to get the entire code here but I was allocated a task/assigment to create a C program that emulates a search engine so I was given a bunch of text files that contains HTML data and I need to be able to scan those files putting aside HTML tags and return/display all the valid words from the file. Now I wonder if someone can give an idea on how would be the best way to assemble a function to do this task. Maybe which C function would be the most appropiate to handle the strings and do the word extraction without touching the tags. Thanks Guys. I really appreciate any help in advance.

harry

Recommended Answers

All 7 Replies

Now I wonder if someone can give an idea on how would be the best way to assemble a function to do this task. Maybe which C function would be the most appropiate to handle the strings and do the word extraction without touching the tags.

So you want to know which one single function you should use? That's a tough one :eek:

Seriously, the way to handle this is to
- read a line at a time
- remove all the html tags leaving only the 'real information' from the line
- then process what's left to get the information you want

Yes what Mr. WaltP has said pretty much clarifies the situation. If you still need to know some functions which will help you in achieving your task then:

1. Read the current line in a string "currentLine".

2. Let the contents be eg. <p>Here is my content</p>.

3. Find the occurance of ">" which marks the end of the starting tag, the position of the "<" which marks the start of the ending tag.

4. Extract the data between the indexes found using the above procedure.

You can try out the foll code snippet:

int main (void)
{
    int a;
    char temp[] = "<p>hello to all</p>" ;
    printf ("%s", temp) ;
    char* my_string = temp ;
    my_string = strpbrk( my_string, ">" ) ;
    my_string ++ ;
    int index = strcspn (my_string, "<") ;
    my_string [index] = '\0' ;
    printf ("\n%s", my_string) ;
    return 0 ;
}

My output:

<p>hello to all</p>
hello to all
Press ENTER to continue.

Though not a very good representation, i just churned that out in some spare time, it will help you understand the basics.

Hope it helped, bye.

Guys,

Thanks a lot!!! for your replies. Much much appreciated. I have enough info to begin with this program. Thanks a lot again Guys.

harry

Yes what Mr. WaltP has said pretty much clarifies the situation. If you still need to know some functions which will help you in achieving your task then:

1. Read the current line in a string "currentLine".

2. Let the contents be eg. <p>Here is my content</p>.

3. Find the occurance of ">" which marks the end of the starting tag, the position of the "<" which marks the start of the ending tag.

4. Extract the data between the indexes found using the above procedure.

You can try out the foll code snippet:

int main (void)
{
    int a;
    char temp[] = "<p>hello to all</p>" ;
    printf ("%s", temp) ;
    char* my_string = temp ;
    my_string = strpbrk( my_string, ">" ) ;
    my_string ++ ;
    int index = strcspn (my_string, "<") ;
    my_string [index] = '\0' ;
    printf ("\n%s", my_string) ;
    return 0 ;
}

My output:


Though not a very good representation, i just churned that out in some spare time, it will help you understand the basics.

Hope it helped, bye.

Hi Guys,

Almost forgot. Is there any C function that I can use to extract valid words form a string or I have to do it using something like strtok() to remove spaces and any special characters. Thanks again.

harry

Hi Guys,

Almost forgot. Is there any C function that I can use to extract valid words form a string or I have to do it using something like strtok() to remove spaces and any special characters. Thanks again.
harry

If you are talking about removing field separators and repacing them with spaces then you can do something like this:

#include <stdio.h>
#include <string.h>
#define LINE_BUF  100

int main()
{
  char  line[LINE_BUF];
  char *sep;

  while (fgets(line, LINE_BUF, stdin))
  {
    sep = line;

    while (sep != 0)
    {
      // search for the given chars and if found store 
      // its ptr in sep. Then they can be changed using
      // dereferencing sep. 
      sep = strpbrk(line, ";.&:,");
      if (sep != 0)
        *sep = ' ';
    }
    fputs(line, stdout);
  }
  return 0;
}

Hope it helped, bye.

Thanks very much. Thanks!!. I'll see what I can do now . Thanks.

Hi ~S.O.S.~

Would you be so kind of changing the below code to implement a loop so I can extract the words within >< regardless the times the >< appears in the line/string. Thanks.

int main (void)
{
int a;
char temp[] = "<p>hello to all</p><p>hello to all</p><p>hello to all</p>" ;
printf ("%s", temp) ;
char* my_string = temp ;
my_string = strpbrk( my_string, ">" ) ;
my_string ++ ;
int index = strcspn (my_string, "<") ;
my_string [index] = '\0' ;
printf ("\n%s", my_string) ;
return 0 ;
}

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.