954,499 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

c++ count html tags

Hi guys.

Im doing an assignment, I am getting on ok with it but I am sadly really stuck at the moment.

The part I am stuck on involves counting html tags in a text file. I have thought of a method of doing this but unfortunately I have no idea how to implement it into the code. My idea is to look for the start symbol of the tag( <) then the contents, then the end symbol (>) Does anyone know how I could do this in c++?

Many thanks

Mat

mat1989
Newbie Poster
8 posts since Apr 2009
Reputation Points: 10
Solved Threads: 0
 

You could just go through the file and when your program comes across a '<' it should just ignore everything after it until it comes across a '>', at that point you've to count a tag ...

tux4life
Nearly a Posting Maven
2,350 posts since Feb 2009
Reputation Points: 2,134
Solved Threads: 243
 

probably something like this:

std::string str = "<html>";
if( str.find("<") != string::npos && str.find(">") != string::npos)
{
    // most likely an html tag
}
Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

Thanks alot guys :) i'll try these ideas out.

mat1989
Newbie Poster
8 posts since Apr 2009
Reputation Points: 10
Solved Threads: 0
 

Dragon I had a stab at using your technique but it returns a random number of tags in the output. :(

mat1989
Newbie Poster
8 posts since Apr 2009
Reputation Points: 10
Solved Threads: 0
 

Show us the code, with CODE TAGS.

MosaicFuneral
Posting Virtuoso
1,691 posts since Nov 2008
Reputation Points: 888
Solved Threads: 116
 
// assignment program 
// read file and copy to another
// count amount of charecters,lines, comments and tags
// change Xhtml tags from upper case to lower case
// place in new file 

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <iomanip> 
#include <cctype>
using namespace std;

int main()
{
string file1,file2;
string str = "<>"; 
ifstream ipfile;
ofstream opfile;
char c;



int amountline = 0;
int amountcha = 0;
int amounttag = 0;
int amountcomment = 0;
cout <<  "Please enter the name of the file you wish to check" << endl;
cin >> file1;


ipfile.open(file1.c_str());


if (!ipfile.is_open())

{
	cout << "Oops! Couldn't open " << file1 << "!\n"<<endl;

	return 1;

}
 
{

cout << " Please enter the file you wish the edited contents to be copied to" << endl;
cout << " This will be created if it does not already exist"<< endl; 
cin >> file2;
}

opfile.open(file2.c_str());


while (!ipfile.eof())

{	

ipfile.get(c);
opfile << c;


		

		if(c!='\n' && !ipfile.eof() && c!=' ')
		{
			amountcha++;
		}
		if(c=='\n')
		{
			amountline++;
		}

	 if( str.find("<") != string::npos && str.find(">") != string::npos)
{
    amounttag++;
}

   
      if ( c == '!')
      {
      amountcomment++;
      }
  

  









cout << " This file contains :" << amountline << " lines" << endl;
cout << " This file contains :" << amountcha << " charecters" << endl;
cout << " this file has : " << amountcomment<<  " comments " << endl;
cout << " This file contains  : " << amounttag << "tags"   << endl; 

cout << " Copy complete, edited code located in " << file2 << endl;



return 0;
}


thats all my code so far.

mat1989
Newbie Poster
8 posts since Apr 2009
Reputation Points: 10
Solved Threads: 0
 

why not just use getline() to read an entire line at one time?

std::string line;
while( getline(ipfile, line) )
{
   // blabla
}


I don't do html coding, but I think any given tag must be on one line, such as "" can not be split between lines, so it doesn't make any sense to read the html file one character at a time.

Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

I can use getline to search for html tags in the line?

mat1989
Newbie Poster
8 posts since Apr 2009
Reputation Points: 10
Solved Threads: 0
 

you use getline() to read an entire line that is terminated with '\n'. Then use string::find() to look for < and > characters as shown in previous example code.

Also note that ipfile.eof() is not needed in my loop because the loop stops on error or end-of-file.

Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

You could tokenize (strtok) by "<" and stick each piece into a vector or so... then get it's size().

Comatose
Taboo Programmer
Team Colleague
2,910 posts since Dec 2004
Reputation Points: 361
Solved Threads: 215
 
You could tokenize (strtok) by "<" and stick each piece into a vector or so... then get it's size().

That sounds like the hard way. strtok() embeds NULLs in the string, which might corrupt the std::string class.

Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

>You could tokenize (strtok) by "<" and stick each piece into a vector or so... then get it's size().

Best not to mix c with c++. Naughty naughty :-0

In any case all these suggestions are too niave. There are a few examples which will slip through the loop, (html attributes)

http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx

So using regular expressions, would be the best way to deal with this however, if this is homework your professor is unlikely to care.

iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
 

I got it working guys :) Code I used below :

if (c=='<' )

	 {
		 
		istag = true;

	 }

	 if (c== '>' && istag == true)

	 {

amounttag++;
istag = false; 
	 }




if (c== '!')

{

	iscomment = true; 

}

if ( c== '-' && iscomment == true)
{
amountcomment++;
amounttag --;
iscomment = false; 

}


works a treat :) , Thanks for all your help.

mat1989
Newbie Poster
8 posts since Apr 2009
Reputation Points: 10
Solved Threads: 0
 

I think I am doing the same assignment as you. Dont forget there maybe an exclamation mark in normal text. "Hello!" you would be better saying if '<' precedes '!' then add 1 to comment count.

Xlphos
Veteran Poster
1,108 posts since Apr 2008
Reputation Points: 32
Solved Threads: 118
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You