line by line HTML detagger

Question

Shraddha Kiran 0 Newbie Poster

14 Years Ago

hey people i wud really really appreciate a hand here
ohk dis is my task
i have been given a bunch of urls to webpages containing blog posts
i need to extract the authorname,date,content of blogpost and meta data and put it into a database.
Here is wat i thought i would do
use a module called gadfly to act as a bridge between python and sql
using urlopen i got the html source code of each webpage ..now i am planning to read this code line by line ,use the html tags to recognise the author name,date,data and metadata..after identifying each part i need to detag it and den store it into the database
cud anyone suggest a way to detag the html code line by line (shud work wid Linux and Python)?/
am really stuck!any help wud be really welcome!

python

4 Contributors
6 Replies
141 Views
1 Week Discussion Span
Latest Post 14 Years Ago Latest Post by Shraddha Kiran

All 6 Replies

vegaseat 1,735 DaniWeb's Hypocrite

14 Years Ago

Take a look at:
http://www.crummy.com/software/BeautifulSoup/

poeticinsanity 2 Light Poster

14 Years Ago

If you use the HTMLParser class provided in python's HTMLParser module. You can use it to create method's that parse through the tags to find certain tags and links and whatnot. That might help.
If that doesn't you can also use the re module provided with python to search for patterns. Those are the simplier methods. It seems like re would be your best option.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Shraddha Kiran 0 Newbie Poster · Answer 1 · 2009-06-12T21:45:04+00:00

hey ppl
thanks so much for the advice..dat really helped!! :D :D

but i have a new problem..as i already mentioned my task is dat i have been given a bunch of websites...they contain blog posts..i have to enter the author name,date,content and comments of each blog into a database ...to do so i looked at the html code of the websites and tried writing a python script to identify the different components...that kinda works except that we have to feed in each url individually and sometimes modify it for diff webpages..so it takes an insane amount of time and my dataset is really huge...heres wat my code looks like
flag=1
import gopherlib
import urllib2
import re
import string
url=urllib2.urlopen('http://blogs.myspace.com/index.cfm?fuseaction=blog.view&friendId=78693621&blogId=493873344')
for author1 in url:
author1=str(author1)
if '<b><a href=' in author1 and '</a></b><br />' in author1:
author1=re.sub("<(?!(?:a/s|/a!))[^>]*>","",author1)
author1=re.sub("</html>\r\n","",author1)
# author1=author1.replace(" ","")
author1=author1.strip()

break
# table=string.maketrans('','')
# delchars=''.join(chr(n) for n in range(32))
# author1=string.translate(author1, table, delchars)

for date1 in url:
date1=str(date1)
if '"blogTimeStamp"' in date1:
flag=0
continue
if (flag==0) and ('&nbsp' in date1):
date1=re.sub("<(?!(?:a/s|/a!))[^>]*>","",date1)
date1=date1.replace("\\r","")
date1=date1.replace("\\n","")
date1=date1.replace(" ","")
date1=date1.strip()

break
else:
continue
flag=1
data2=' '
a=1
fl=0
for data1 in url:
# print a
data1=str(data1)
if '"blogContent"' in data1 and 'BlogBody_' in data1:
flag=0
print flag
continue
# print flag

#if ((flag==0 )and 'smile' in data1):
# if (flag==0) and '<br />' in data1:
# fl=0
# continue
# if (flag==0) and (fl==0):
#if flag==0 and ( fl==0 or 'game' in data1 or 'to' in data1 or '$' in data1 or 'magic' in data1 or 'Xeko' in data1 or 'TBPs' in data1 or 'v.' in data1 or 'issues' in data1 or '#' in data1 or 'VOL' in data1 or 'on' in data1 or 'DC' in data1 or 'GN' in data1 or 'TP' in data1 or 'HC' in data1 or 'BLK' in data1 or '6' in data1 or 'TOYS' in data1 or '8' in data1 or '&' in data1 or 'ED' in data1 or 'OF' in data1 or 'X' in data1 or 'TIME' in data1 or 'US' in data1 or 'COLL' in data1 or 'HACK' in data1 or '2009' in data1 or 'ROCK' in data1 or 'PVC' in data1 or 'T/S' in data1):
if flag==0 and ('<P>'in data1 or '</P>' in data1 or '<br>' in data1 or '<BR>' in data1 or '<DIV>' in data1 or '</DIV>' in data1 or '<p>' in data1 or '<p' in data1 or '<b>' in data1 or '<u>' in data1 or '</b>' in data1 or '</u>' in data1 or '<FONT' in data1 or '<H3' in data1 or 'is' in data1 or 'I' in data1 or 'No' in data1 or 'for' in data1 or '<TD' in data1 or '</TD>' in data1 or 'FONT face=Georgia size=-1>' in data1 or '</div>' in data1):
# if flag==0:
print flag
data1=re.sub("<(?!(?:a/s|/a!))[^>]*>","",data1)
data1=data1.replace("\\r","")
data1=data1.replace("\\n","")
data1=data1.replace(" ","")
data1=data1.replace("\\xc2","")
data1=data1.replace("\\xa0","")
data1=data1.replace("\\t","")
data1=data1.replace("\\xe2","")
data1=data1.replace("\\x80","")
data1=data1.replace("\\x9d","")
data1=data1.replace("\\x9c","")
data1=data1.replace("\\xa6","")
data1=data1.replace("\\x99","")
data1=data1.replace("\\xab","")
data1=data1.replace("\\","")
data1=data1.replace("\xe2\x80\x93"," ")
data1=data1.strip()
# print data1
data2=data2+" "+data1
continue
# fl=1
#print data2
#break
#continue
#elif '<br />' in data1:
#fl=1
#print fl
#continue
#elif '<br />' in data1 and fl==1:
#fl=0
#print fl
#if flag==0 and '</div>' in data1:
break
if " " in data1:
# print "break"
break

else:
continue

for mdata1 in url:
mdata1=str(mdata1)
if '</label>' in mdata1 and '</div>' in mdata1:
mdata1=re.sub("<(?!(?:a/s|/a!))[^>]*>","",mdata1)
mdata1=mdata1.replace("\\r","")
mdata1=mdata1.replace("\\n","")
mdata1=mdata1.replace(" ","")
mdata1=mdata1.strip()
break

for mdata2 in url:
mdata2=str(mdata2)
if '</label>' in mdata2 and '</div>' in mdata2 and ' ' in mdata2:
mdata2=re.sub("<(?!(?:a/s|/a!))[^>]*>","",mdata2)
mdata2=mdata2.replace("\\r","")
mdata2=mdata2.replace("\\n","")
mdata2=mdata2.replace(" ","")
mdata2=mdata2.strip()

break

import gadfly
connection=gadfly.gadfly("/media/disk/mydatabase","/media/disk/mydir")
cursor=connection.cursor()
sql="insert into mydatabase(db_Id,author,date,data,mdata) values (?,?,?,?,?)"
cursor.execute(sql,('168',author1,date1,data2,mdata1+mdata2))
cursor.execute("select * from mydatabase")
for x in cursor.fetchall():
print x
connection.commit()

dats for the author name,date,data and metadata ..the code is written for blogs.myspace.com and for the comments database our code is like
flag=1
import gopherlib
import urllib2
import re
import string
commname=""
import gadfly
connection=gadfly.gadfly( "/media/disk/commtable" , "/media/disk/mydir")
cursor=connection.cursor()
n=160
ct=0
for comment in urllib2.urlopen('http://blogs.myspace.com/index.cfm?fuseaction=blog.view&friendId=78693621&blogId=493873344'):
ct=0
# print ct
if 'CommentDiv_' in comment:
flag=0
#print flag
continue
# if (f==1) and ('<br>' in comment or ' ' in comment):
if (flag == 0) and ('<br>' in comment or ' ' in comment):
#print flag
comment=re.sub("<(?!(?:a/s|/a!))[^>]*>","",comment)
comment=comment.replace("\\r","")
comment=comment.replace("\\n","")
comment=comment.replace(" ","")
comment=comment.replace("\xc2","")
comment=comment.replace("\xa0","")
comment=comment.replace("\\t","")
comment=comment.replace("\xe2","")
comment=comment.replace("\x80","")
comment=comment.replace("\x9d","")
comment=comment.replace("\x9c","")
comment=comment.replace("\xa6","")
comment=comment.replace("\x99","")
comment=comment.replace("\xab","")
comment=comment.replace("~\xe2\x99\xab","")
comment=comment.replace("\xe2\x99\xab~","")
comment=comment.replace("\'","")
comment=comment.replace("[Reply to this]","")
comment=comment.replace("\x84","")
comment=comment.replace("\xa2","")
comment=comment.strip()
#commname=comment
commname=commname+comment
print commname
continue
# commname=commname.split()
# print flag
if '</div>' in comment:
continue
if 'Posted by' in comment:
comment=re.sub("<(?!(?:a/s|/a!))[^>]*>","",comment)
comment=comment.replace("\\r","")
comment=comment.replace("\\n","")
comment=comment.replace(" ","")
comment=comment.replace("\xc2","")
comment=comment.replace("\xa0","")
comment=comment.replace("\\t","")
comment=comment.replace("\xe2","")
comment=comment.replace("\x80","")
comment=comment.replace("\x9d","")
comment=comment.replace("\x9c","")
comment=comment.replace("\xa6","")
comment=comment.replace("\x99","")
comment=comment.replace("\xab","")
comment=comment.replace("\xa7","")
comment=comment.replace("~\xe2\x99\xab","")
comment=comment.replace("\xe2\x99\xab~","")
comment=comment.replace("\'","")
comment=comment.replace("'","")
comment=comment.replace("\x84","")
comment=comment.replace("\xa2","")
comment=comment.strip()
commauthor1=comment
ct=1
else:
continue
if ct==1:
n=n+1
sql="insert into commtable(db_Id,comm_Id,comment,commauthor) values (?,?,?,?)"
cursor.execute(sql,('168',n,commname,commauthor1))
commname=""
print ct

cursor.execute("select * from commtable")
for x in cursor.fetchall():
print x
connection.commit()
would really really appreciate some help here!!

thanks!!
Shraddha

jlm699 320 Veteran Poster · Answer 2 · 2009-06-12T22:21:36+00:00

1) Use code tags when posting code in this forum or most people will simply ignore your post. Use them like so:

[code=python] # Code inside here

[/code]

2) What exactly is the problem? All you said is:

we have to feed in each url individually and sometimes modify it for diff webpages..so it takes an insane amount of time and my dataset is really huge...

So what's the problem? Modify urls ? Large dataset? Clarify please

Shraddha Kiran 0 Newbie Poster · Answer 3 · 2009-06-13T06:12:36+00:00

hey
well the problem is both..since the dataset is huge feeding in each url takes a lot of time...and i have to finish this task soon

would u know a simpler way to do it?

Dat would really be a big big help!!

Thanks

Shraddha Kiran 0 Newbie Poster · Answer 4 · 2009-06-17T16:02:58+00:00

hey ppl
i have a new problem. i'm trying to read the list of urls from a file and use urllib2 to to open the websites. I have used the following code to do so.

import gopherlib
import urllib2
import re

import os
f=open('list2','r')
fan=open('list','a')
for url in f:
    #print url
    url=str(url)
    for line in urllib2.urlopen("url"):
        line=str(line)
        if '</a>&nbsp;</div>' in line and '<div class="cmtcell"' in line and "<a href" in line:
            line=line.replace('<div class="cmtcell">',"")
            line=line.replace("<a href=","")
            line=line.replace("\\'","")
            in1=line.find('>')
            in2=line.find('<')
            x=line[in1:in2+1]
            line=line.replace(x,"")
            line=line.replace("/a>&nbsp;</div>","")
            line=line.replace("\\n","")
            line=line.replace("\\r","")
            line=line.replace("'","")
            line=line.replace(" ","")
            fan.write(line)

and the error being reported is
Traceback (most recent call last):
File "data.py", line 11, in <module>
for line in urlopen("url"):
File "/usr/lib/python2.5/urllib2.py", line 135, in urlopen
return _opener.open(url, data)
File "/usr/lib/python2.5/urllib2.py", line 316, in open
type_ = req.get_type()
File "/usr/lib/python2.5/urllib2.py", line 220, in get_type
assert self.type is not None, self.__original
AssertionError: url

could u look at it and give me the solution to the problem??
would be really grateful!
shraddha

line by line HTML detagger

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers