Hi,
i am little confused in usage of strtok() syntax. Basically , i extracted the source code of a webpage using Java and now i have to remove all the extra tags inside the html page so i can have clear text. For example a simple HTML page is like

<html>
<Title>
My Page
</title>
</html>

Now in the above example i want all the tags to be removed and i want to print the text of the page in file , like "My Page " in file2. i have written the code but forsome reason it is removing the text instead of text, secondly i am very confused abt the use of strtok, so i will appreciate some guidance on this.
Code:

#include<stdio.h>
#include<stdlib.h>

FILE *f1,*f2;

int main()
{

  char x[0xFFF]; // = 4095 as size of char array
  char *ptr1=NULL;

  f1 = fopen ("result.dat" , "r");
  f2 = fopen("result1.dat","w");

  if (f1== NULL)
  {
     printf("\nSorry but there is no file on disk known as result");
     exit(1);
     }

while (fgets(x,0xFFE,f1)!=NULL)
{

   ptr1 = strtok(x,"<");
   while (ptr1 !=NULL)

   {

    fprintf(f2,"\n%s", x);

     ptr1 = strtok(NULL ,">");

     }
}

fclose(f1);
fclose(f2);

return 0;
}

Recommended Answers

All 5 Replies

Well you need to find matching pairs of < >

It seems to me that it's
Find <
Skip ahead until you find >
Print what you find until the next <

Perhaps write the first program as something which just prints out
Found "<head>"
Found "<body>"
Found "</body>"
Found "</head>"
When you can do that reliably, then expand the scope of the program towards your final goal.

ok that makes sense , but i would like some infromation on strtok, i have searched on web and found some material but i cant figure out whether it can be applied to my case or not, link is :

http://www.cplusplus.com/reference/clibrary/cstring/strtok.html

So plz if anyone can explain the function of this syntax , so i can figure out whether i can use this syntax to remove html tags or not ?

Because what i think about this command is that it lets you print a certain field which has text in it , like " *hi*bye " so only hi will be printed (because of the asteriks) , so according to this i can do like replace <html> with "" hence removing the tag, i hope u understand what i am trying to do.
thanks in advance :)

hi ,
i have worked out the code and it works quite well, but i am not using strtok, so still not satisfied here is the code :

#include <stdio.h>
#include<string.h>
#include<stdlib.h>

void main(void)
{

  char x;
  char x1[0xff];
  int i;
  int f=0;
  int length=0;
  printf("\nEnter the string with html tags :");

  scanf("%s",&x1);

  length = strlen(x1);

   for( i = f ; i<= (length -1); i++)

    {
	if(x1[i] == '<')
        {
	   for(f=i; f<= (length-1);f++)
             {
		  printf("");

	       if(x1[f] =='>')
                 {
		  i=f;
		  break;
                  }
         
            }
       
     }

         else {
		 printf("%c", x1[i]);
           }
   
       }

}

Output:
Enter the string with html tags :<html>hi</html>
hi

I'd suggest to you that STRTOK might not be the best choice here. because the delimiters in STRTOK are treated equally. STRTOK takes a string and divides it up into any number of equivalent "tokens" based on the blind application of one or more "delimiters" that are used to mark the boundaries of the tokens.

For your problem, you want to differentiate between opening bracket < and closing bracket >, so you can treat the text found within the brackets differently than the text found without.

So I would instead use a couple of char * pointers to keep track of relative positions within the search string, in conjunction with STRSTR to find the next bracket location, along with a variable that keeps track of whether you are "inside" a tag (because the last bracket found was '<') or "outside" a tag (because the last bracket found was '>')


.

ok i understood , what is strtok is exactly doing , your method is right , abt having two pointers. Thanks for help :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.