954,499 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Analysing HTML documents in C++

I started with curl and now I'm able to post form data, obtain html pages and all that basic stuff. But I'm unable to analyse individual data in the obtained webpage. Like filtering all the images, videos and things like that. I can do that in Javascript, but that can only be run from a browser, so I want a way to analyse a webpage without a browser.
I searched around the web and I found it was called parsing. Can someone explain how it works, and if individual elements in a webpage can be obtained using a parser or have I misunderstood it? Is there any other way to do this using C++ alone?

Thanks in advance.

IndianaRonaldo
Light Poster
39 posts since Jan 2011
Reputation Points: 7
Solved Threads: 1
 
Ancient Dragon
Retired & Loving It
Team Colleague
30,049 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

Atlast found a standard library that has good forum support and suits my job. Libxml2 seems to be the best options for this. Libcurl coupled with libxml2 and I could make a code for downloading wallpapers from a site. It has good functionality and documentation support.

IndianaRonaldo
Light Poster
39 posts since Jan 2011
Reputation Points: 7
Solved Threads: 1
 

I searched around the web and I found it was called parsing. Can someone explain how it works, and if individual elements in a webpage can be obtained using a parser or

yes this can be easily done. you can use a number of techniques to this. at the moment i am working on developing a search engine in C/C++ which includes a spider which gets the html code from a server and parses it for all the links.

the technique that i follow is what i learned in compiler construction, that is to design and develop a transition diagram and then code it, which reads the html code character by character. at the same time you have states, for each character there is a specific state so that you know what character and word you have read.

however where i dont need such precision i have used the searching technique, and where i need precision and need to know the order of the elements i use the character by character search.

mustafaneguib
Junior Poster
102 posts since Apr 2008
Reputation Points: 14
Solved Threads: 4
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: