hai i am new to python, can any one please help me how to parse data from an html file,
i want to display the content which lies under a particular tag,and also can you please tell where can i find tutorials for this topic with sample examples.
Python has 2 very good 3 party parser BeautifulSoup and lxml.
This parser can handle html that is no good,this can be important.
An example with BeautifulSoup.
We want the price of beans from this site. http://beans.itcarlow.ie/prices.html
from BeautifulSoup import BeautifulSoup
#Read in website
url = urllib2.urlopen('http://beans.itcarlow.ie/prices.html')
soup = BeautifulSoup(url)
print soup #website contents
tag = soup.findAll('strong') #Find strong tag
print tag #[<strong>$6.36</strong>]
print tag.string #Print out info we want "$6.36"
, and after saving it as an html file. i should write a prgram so that it should display the content uder "" What do I need to create HTML ? "" ,if you clearly oserve that site you will find his heading
Requirement : have to display only :
You don't need any special equipment or software to create HTML. In fact, you probably already have everything you need. Here is what you need:
•Text or HTML editor. Most computers already have a text editor and you can easily create HTML files using a text editor. Having said that, there are definite benefits to be gained in downloading an HTML editor.
If you want the best HTML editor, and you don't mind paying money for it, you can't go past Adobe Dreamweaver. Dreamweaver is probably the best HTML editor available, and you can download a trial version for starters.
If you don't have the cash to purchase an editor, you can always download a free one. Examples include SeaMonkey, Coffee Cup (Windows) and TextPad (Windows).
If you don't have an HTML editor, and you don't want to download one just now, a text editor is fine. Most computers already have a text editor. Examples of text editors include Notepad (for Windows), Pico (for Linux), or Simpletext/Text Edit/Text Wrangler (Mac).
•Web Browser. For example, Internet Explorer or Firefox.
So for this type of methods, i think i have to write a regular Expression to find tht particulat tag to display that partucular data. But i dont know how to do this..please guys help me!!!!
If you insist, you can look for just the particular tag using regex, but that is the hard way.
If you really have a one-shot parsing need that looks for a particular "<h2>", then reads until the next one, you can just look in each line for "<h2>" and "</h2>" and use a simple state machine to copy all the lines between the particular "</h2>" and the next "<h2>". This is easier than regex, and may even be faster.
If you need to be able to parse HTML in general, with this case as a particular example, then look here (again) for the easy way that does not use BeautifulSoup: http://docs.python.org/library/htmlparser.html This really is (one of) the right way(s) to do what you want.
sorry,i dont need code,i just need a sample example using those regular expressions,because i am supossed to follow only that procedure.yeah i already had a look at this site but its showing only methods i have to use.
For small website is it`s possibly to use only regex.
But the way is to regex in a combo with BeautifulSoup,lxml
I use this two parser,because the are best at parsing html.
HTMLParser will break if html is a little malformed,very few sites has perfect html.
Around only 5% on all websites on internet has 100% valid html.
You need to think about what regex patterns you will need; and you need to think about the fact that opening tags may not appear on the same line as their closing tags. Regex information is here: http://docs.python.org/library/re.html There are code snippets at that URL. Read the whole page (or at least scan it and read the interesting parts). Look for the difference between search and find functions, and for a way to split lines at a particular regex occurrence.
The general idea is that you read the file one line at a time, you parse each line looking for the next thing that is needed (which changes depending on what you have already seen). At some point, you begin to collect (partial) lines until you find another regex hit for the ending line, when you may store a partial line, then break out of the read/parse loop and display/write the data you were looking for.