943,779 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Unsolved
  • Views: 6373
  • Python RSS
Dec 11th, 2008
0

Parse HTML to get text from webpages

Expand Post »
I need to write a Python program which parses webpages and returns a dictionary of unique words and their frequenies. What I came up with was
Python Syntax (Toggle Plain Text)
  1. #!/usr/bin/env python
  2. from HTMLParser import HTMLParser, HTMLParseError
  3. import urllib
  4. import urlparse
  5. import sys
  6. import os
  7. import MySQLdb
  8. import re
  9.  
  10. class WordHarvester:
  11. def __init__(self):
  12. self.db = MySQLdb.connect(host="my.host.dom", user="user", passwd="passwd",db="db")
  13. self.dbc = self.db.cursor()
  14. self.mhp = MyHTMLParser()
  15.  
  16. def run(self,URL,limit=10):
  17. print "Running URL:"
  18. print URL
  19. print " "
  20. try:
  21. sock = urllib.urlopen(URL)
  22. h = sock.read()
  23. self.mhp.feed(h)
  24. except:
  25. print "Error reading",URL
  26. i=0
  27. ur = URL
  28. while i<limit:
  29. sock.close()
  30. local_list = self.mhp.urls
  31. self.mhp.urls=[]
  32. for u in local_list:
  33. try:
  34. split_url= urlparse.urlsplit(urlparse.urljoin(ur,u))
  35. if split_url.scheme == "http":
  36. u = urlparse.urlunsplit((split_url.scheme,split_url.netloc,split_url.path,"",""))
  37. else:
  38. continue
  39. print u
  40. sock = urllib.urlopen(u)
  41. h = sock.read()
  42. self.mhp.feed(h)
  43. except:
  44. print "Error reading",u
  45. i=i+1
  46. print "Your word counts:"
  47. for word in self.mhp.word_dictionary.keys():
  48. print word,": ",self.mhp.word_dictionary[word]
  49.  
  50.  
  51.  
  52. class MyHTMLParser(HTMLParser):
  53.  
  54. def __init__(self):
  55. HTMLParser.__init__(self)
  56. self.urls=[]
  57. self.word_dictionary = {}
  58.  
  59. def handle_starttag(self,tag,attrs):
  60. if tag=='script':
  61. self.inscript=True
  62. if tag=='body':
  63. self.inbody=True
  64. if tag=='a' and attrs:
  65. #if attrs[0][1][:4]=='http':
  66. self.urls.append(attrs[0][1])
  67.  
  68. def handle_endtag(self, tag):
  69. if tag=='script':
  70. self.inscript=False
  71. if tag=='body':
  72. self.inbody=False
  73.  
  74. def handle_data(self,data):
  75. l = []
  76. m = []
  77. n = []
  78.  
  79. for s in data:
  80. print s
  81. l = s.split()
  82. for j in l:
  83. m.append(j)
  84. for h in m:
  85. h = h.strip(' ,.?!-')
  86. h = h.lower()
  87. n.append(h)
  88. n.sort()
  89. for i in n:
  90. if i.isalpha():
  91. if i in self.word_dictionary:
  92. self.word_dictionary[i]=self.word_dictionary[i]+1
  93. else:
  94. self.word_dictionary[i]=1
  95.  
  96.  
  97. if __name__ == "__main__":
  98. print "Testing Word Harvester"
  99. harvester = WordHarvester()
  100. url = raw_input("Starting URL: ")
  101. harvester.run(URL=url)

This appears to read all the web pages it is supposed to, but the handle_data only adds individual letters to the dictionary, not whole words like it should. Also, after about the first ten web pages, handle_data, and presumably the other handle_* methods of HTML Parser are not called, the print statement I added to handle_data is only called for the first few pages, and the amount of text that is printed is not the entire web page, usually only a couple of words from the beginning of the web page. I have very minimal knowledge of python and this was all I could get from scratch
Reputation Points: 10
Solved Threads: 0
Newbie Poster
pocnib is offline Offline
6 posts
since Sep 2008
Dec 11th, 2008
0

Re: Parse HTML to get text from webpages

First, if this is just a one-time thing for you, you can use Links to download and save the page as text only. http://www.jikos.cz/~mikulas/links/download/binaries/ Also, I assume you know about BeautifulSoup and that is more than you want. To answer your questions
1. Stops after 10 iterations (as the saying goes, this is too coincidental to be a coincidence)
Python Syntax (Toggle Plain Text)
  1. def run(self,URL,limit=10):
  2. ## and
  3. while i<limit:
2. Single letters are in the dictionary. Difficult to tell but some print statements will add some light. I'm thinking that you want to use j, since it is each word from the split(), but the print should clarify that.
Python Syntax (Toggle Plain Text)
  1. def handle_data(self,data):
  2. l = []
  3. m = []
  4. n = []
  5. for s in data:
  6. print "s in data", s
  7. l = s.split()
  8. ## fixed indentation problem
  9. for j in l:
  10. print " j in l", j
  11. m.append(j)
  12. ## wouldn't h be the same as j, so m=l
  13. print "\n'l' =", l
  14. print "'m' =", m
  15. for h in m:
  16. print " h in m", h
  17. h = h.strip(' ,.?!-')
  18. h = h.lower()
  19. n.append(h)
  20. ## if "s" is one record from data, then indentation problem (fixed)
  21. n.sort()
  22. for i in n:
  23. if i.isalpha():
  24. if i in self.word_dictionary:
  25. self.word_dictionary[i]=self.word_dictionary[i]+1
  26. else:
  27. self.word_dictionary[i]=1
  28. ##
  29. ##--------This might easier to understand ------------------------
  30. ## but I'm not sure what your data looks like
  31. for rec in data:
  32. word_list= rec.split()
  33. for word in word_list:
  34. word = word.strip(' ,.?!-')
  35. word = word.lower()
  36. if word in self.word_dictionary:
  37. self.word_dictionary[word] += 1
  38. else:
  39. if word.isalpha():
  40. self.word_dictionary[word] = 1
Reputation Points: 741
Solved Threads: 692
Nearly a Posting Maven
woooee is offline Offline
2,305 posts
since Dec 2006

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: Number help
Next Thread in Python Forum Timeline: Gmail using smptblib





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC