Parse HTML to get text from webpages

Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Reply

Join Date: Sep 2008
Posts: 6
Reputation: pocnib is an unknown quantity at this point 
Solved Threads: 0
pocnib pocnib is offline Offline
Newbie Poster

Parse HTML to get text from webpages

 
0
  #1
Dec 11th, 2008
I need to write a Python program which parses webpages and returns a dictionary of unique words and their frequenies. What I came up with was
  1. #!/usr/bin/env python
  2. from HTMLParser import HTMLParser, HTMLParseError
  3. import urllib
  4. import urlparse
  5. import sys
  6. import os
  7. import MySQLdb
  8. import re
  9.  
  10. class WordHarvester:
  11. def __init__(self):
  12. self.db = MySQLdb.connect(host="my.host.dom", user="user", passwd="passwd",db="db")
  13. self.dbc = self.db.cursor()
  14. self.mhp = MyHTMLParser()
  15.  
  16. def run(self,URL,limit=10):
  17. print "Running URL:"
  18. print URL
  19. print " "
  20. try:
  21. sock = urllib.urlopen(URL)
  22. h = sock.read()
  23. self.mhp.feed(h)
  24. except:
  25. print "Error reading",URL
  26. i=0
  27. ur = URL
  28. while i<limit:
  29. sock.close()
  30. local_list = self.mhp.urls
  31. self.mhp.urls=[]
  32. for u in local_list:
  33. try:
  34. split_url= urlparse.urlsplit(urlparse.urljoin(ur,u))
  35. if split_url.scheme == "http":
  36. u = urlparse.urlunsplit((split_url.scheme,split_url.netloc,split_url.path,"",""))
  37. else:
  38. continue
  39. print u
  40. sock = urllib.urlopen(u)
  41. h = sock.read()
  42. self.mhp.feed(h)
  43. except:
  44. print "Error reading",u
  45. i=i+1
  46. print "Your word counts:"
  47. for word in self.mhp.word_dictionary.keys():
  48. print word,": ",self.mhp.word_dictionary[word]
  49.  
  50.  
  51.  
  52. class MyHTMLParser(HTMLParser):
  53.  
  54. def __init__(self):
  55. HTMLParser.__init__(self)
  56. self.urls=[]
  57. self.word_dictionary = {}
  58.  
  59. def handle_starttag(self,tag,attrs):
  60. if tag=='script':
  61. self.inscript=True
  62. if tag=='body':
  63. self.inbody=True
  64. if tag=='a' and attrs:
  65. #if attrs[0][1][:4]=='http':
  66. self.urls.append(attrs[0][1])
  67.  
  68. def handle_endtag(self, tag):
  69. if tag=='script':
  70. self.inscript=False
  71. if tag=='body':
  72. self.inbody=False
  73.  
  74. def handle_data(self,data):
  75. l = []
  76. m = []
  77. n = []
  78.  
  79. for s in data:
  80. print s
  81. l = s.split()
  82. for j in l:
  83. m.append(j)
  84. for h in m:
  85. h = h.strip(' ,.?!-')
  86. h = h.lower()
  87. n.append(h)
  88. n.sort()
  89. for i in n:
  90. if i.isalpha():
  91. if i in self.word_dictionary:
  92. self.word_dictionary[i]=self.word_dictionary[i]+1
  93. else:
  94. self.word_dictionary[i]=1
  95.  
  96.  
  97. if __name__ == "__main__":
  98. print "Testing Word Harvester"
  99. harvester = WordHarvester()
  100. url = raw_input("Starting URL: ")
  101. harvester.run(URL=url)

This appears to read all the web pages it is supposed to, but the handle_data only adds individual letters to the dictionary, not whole words like it should. Also, after about the first ten web pages, handle_data, and presumably the other handle_* methods of HTML Parser are not called, the print statement I added to handle_data is only called for the first few pages, and the amount of text that is printed is not the entire web page, usually only a couple of words from the beginning of the web page. I have very minimal knowledge of python and this was all I could get from scratch
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,071
Reputation: woooee is a jewel in the rough woooee is a jewel in the rough woooee is a jewel in the rough 
Solved Threads: 299
woooee woooee is offline Offline
Veteran Poster

Re: Parse HTML to get text from webpages

 
0
  #2
Dec 11th, 2008
First, if this is just a one-time thing for you, you can use Links to download and save the page as text only. http://www.jikos.cz/~mikulas/links/download/binaries/ Also, I assume you know about BeautifulSoup and that is more than you want. To answer your questions
1. Stops after 10 iterations (as the saying goes, this is too coincidental to be a coincidence)
  1. def run(self,URL,limit=10):
  2. ## and
  3. while i<limit:
2. Single letters are in the dictionary. Difficult to tell but some print statements will add some light. I'm thinking that you want to use j, since it is each word from the split(), but the print should clarify that.
  1. def handle_data(self,data):
  2. l = []
  3. m = []
  4. n = []
  5. for s in data:
  6. print "s in data", s
  7. l = s.split()
  8. ## fixed indentation problem
  9. for j in l:
  10. print " j in l", j
  11. m.append(j)
  12. ## wouldn't h be the same as j, so m=l
  13. print "\n'l' =", l
  14. print "'m' =", m
  15. for h in m:
  16. print " h in m", h
  17. h = h.strip(' ,.?!-')
  18. h = h.lower()
  19. n.append(h)
  20. ## if "s" is one record from data, then indentation problem (fixed)
  21. n.sort()
  22. for i in n:
  23. if i.isalpha():
  24. if i in self.word_dictionary:
  25. self.word_dictionary[i]=self.word_dictionary[i]+1
  26. else:
  27. self.word_dictionary[i]=1
  28. ##
  29. ##--------This might easier to understand ------------------------
  30. ## but I'm not sure what your data looks like
  31. for rec in data:
  32. word_list= rec.split()
  33. for word in word_list:
  34. word = word.strip(' ,.?!-')
  35. word = word.lower()
  36. if word in self.word_dictionary:
  37. self.word_dictionary[word] += 1
  38. else:
  39. if word.isalpha():
  40. self.word_dictionary[word] = 1
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Other Threads in the Python Forum


Views: 2233 | Replies: 1
Thread Tools Search this Thread



Tag cloud for Python
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC