We're a community of 1076K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,075,654 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

how do I extract data from html file ?

Hi,
I want write a program which
extract 'Rated PG for some scary moments and mild language' from the following html file and return it as a list .

html file:
<div class="info">
<h5><a href="/mpaa">MPAA</a>:</h5>

<div class="info-content">
Rated PG for some scary moments and mild language. (also 2009 extended version)
</div>
</div>

Why wouldnt this code work ?
mpaaget = re.compile('<h5><a href="/mpaa">MPAA</a>:</h5><div class="info-content">(.*?)</div>')
mpaa = mpaaget.findall(htmlr)

4
Contributors
10
Replies
2 Days
Discussion Span
3 Years Ago
Last Updated
11
Views
Question
Answered
masterinex
Newbie Poster
17 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

use a html parser for this job, such as BeautifulSoup. If you don't want to, then another way is to read the whole html, split on "</div>", go through each element in the list, check for "<div class="info-content">", if found, replace it will null. You will get your string

ghostdog74
Junior Poster
156 posts since Apr 2006
Reputation Points: 75
Solved Threads: 48
Skill Endorsements: 0

use a html parser for this job, such as BeautifulSoup. If you don't want to, then another way is to read the whole html, split on "</div>", go through each element in the list, check for "<div class="info-content">", if found, replace it will null. You will get your string

hey, I tried it with
mpaaget = re.compile('<div class="info-content">(.*?)</div>')
but then I got something else . Could it be because there is a new line after <div class="info-content"> ? How do I take care of that?

<div class="info-content">
Rated PG for some scary moments and mild language. (also 2009 extended version)
</div>

masterinex
Newbie Poster
17 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

ohw I got it now , thanks for pointing it out .

masterinex
Newbie Poster
17 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

hey, I tried it with
mpaaget = re.compile('<div class="info-content">(.*?)</div>')
but then I got something else . Could it be because there is a new line after <div class="info-content"> ? How do I take care of that?

Yes, the white space does not fit into your regular expression. Modify like so to match 0 or any number (*) of white space characters (\s):

>>> m = re.compile('<h5><a href="/mpaa">MPAA</a>:</h5>\s*<div class="info-content">\s*(.*?)\s*</div>')
>>> m.findall(h)
['Rated PG for some scary moments and mild language. (also 2009 extended version)']
>>> m.match(h)
>>>
jlm699
Veteran Poster
1,112 posts since Jul 2008
Reputation Points: 355
Solved Threads: 293
Skill Endorsements: 0

Yea , that was problem ,thanks for pointing it out again .
Looks like \n and \s* are the same character .

I have another question. lets say
I want to extract the number 7.2 from the html string below :

<a href="/ratings_explained">weighted average</a> vote of <a href="/List?ratings=7">7.2</a> / 10</p><p>

how come this doesnt work ?

averageget = re.compile('<a href="/List?ratings=7">(.*?)</a>')
average = averageget.findall(htmlr)

Could it be that there some special structures in the html file again which I missed out ?

masterinex
Newbie Poster
17 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

This time it's because '?' is a special character in regular expressions (you're using it inside your group). The question mark indicates a greedy match of 1 or more (where as the asterick (*) is a greedy match of 0 or more). To match the question mark character itself you need to escape it in your regex like so: \? . The full regular expression then becomes:

>>> c = re.compile('<a href="/List\?ratings=7">(.*?)</a>')
>>> c.findall(t)
['7.2']
jlm699
Veteran Poster
1,112 posts since Jul 2008
Reputation Points: 355
Solved Threads: 293
Skill Endorsements: 0

ohw I see so its the '?' that causing the trouble ,
what is t btw do I need to assighn a value to it ?

masterinex
Newbie Poster
17 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

if you want to use regex, you should compile your regex with re.DOTALL and re.M for multiline match.

ghostdog74
Junior Poster
156 posts since Apr 2006
Reputation Points: 75
Solved Threads: 48
Skill Endorsements: 0

Im a little unfamiliar with Python , what are re.DOTALL and re.M are they modules ?

masterinex
Newbie Poster
17 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

They are Flags for compile():
re.MULTILINE (or re.M) string and each line
re.DOTALL (or re.S) match any character, including a newline
re,IGNORECASE (or re.I) case-insensitive matching

vegaseat
DaniWeb's Hypocrite
Moderator
6,464 posts since Oct 2004
Reputation Points: 1,447
Solved Threads: 1,608
Skill Endorsements: 34
Question Answered as of 3 Years Ago by ghostdog74, jlm699 and vegaseat

This question has already been solved: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
View similar articles that have also been tagged:
 
© 2013 DaniWeb® LLC
Page rendered in 0.0965 seconds using 2.83MB