HTMLParser is avoiding some characters that are data

Question

Huakalero 0 Newbie Poster

14 Years Ago

Hi, I'm working on a unofficial app to get the movie listings from a webpage http://cinepolis.com.mx. In order to get the correct movie listings, the user must select his city.

Now, using HTMLParser I was able to get the list of cities, but because some of these have non english chars like ñ or á é í ó ú the HTMLParser just avoid them an returns the word Cancún as 'Canc' and 'n', when what it should do is give the 'Cancún' as a single word.

It wouldn't matter if it returns me the word in pieces, I could just join them, but it is avoiding chars and never giving them.

Here is the code:

'''
Created on Jun 14, 2011

@author: augusto
'''

from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
  def __init__(self, url):
    self.this_is_the_tag = False
    self.end_of_city = False
    self.this_city = ""
    self.cities = []
    HTMLParser.__init__(self)
    req = urlopen(url)
    self.feed(req.read())
    
    
  def checkAttr(self, dic, attr, value):
    for pair in dic:
      if pair[0] == attr and pair[1] == value:
        return True
    return False
  
  def handle_starttag(self, tag, attrs):
    if tag == 'select' and self.checkAttr(attrs, 'id', 'ctl00_ddlCiudad'):
      print "Found div => "
      print self.get_starttag_text()
      self.this_is_the_tag = True
    if self.this_is_the_tag and tag == 'option':
      print "Found option value = ",  attrs[-1][1]
  
  def handle_endtag(self, tag):
    if tag == 'select' and self.this_is_the_tag:
      print "End of div => "
      self.this_is_the_tag = False
      print self.cities
    if tag == 'option':
      self.end_of_city = True
              
  def handle_data(self, data):
    if self.this_is_the_tag and not self.end_of_city:
      print self.get_starttag_text()
      print "-%s-" % data
      self.this_city += data
    elif self.end_of_city:
      self.cities.append(self.this_city)
      self.this_city = ""
      self.end_of_city = False  

Spider('http://cinepolis.com.mx/index.aspx')

This is the output i get (Cancn should be Cancún, Cd. Cuauhtmoc should be Cd. Cuauhtémoc and so on) :

Cancn[/B]', 'Cd. Acua', 'Cd. Cuauhtmoc ', 'Cd. Jurez', 'Cd. Obregn', 'Cd. Victoria', 'Celaya', 'Chetumal', 'Chihuahua', 'Chilpancingo', 'Coatzacoalcos', 'Colima', 'Comitn', 'Cozumel', 'Cuautla', 'Cuernavaca', 'Culiacn', 'D.F. y A.M. (Centro)', 'D.F. y A.M. (Norte)', 'D.F. y A.M. (Oriente)', 'D.F. y A.M. (Poniente)', 'D.F. y A.M. (Sur)', 'Durango', 'Ensenada', 'Guadalajara', 'Hermosillo', 'Hidalgo del Parral', 'Iguala', 'Irapuato', 'La Paz', 'Len', 'Manzanillo', 'Matamoros', 'Mrida', 'Mexicali', 'Minatitln', 'Monterrey', 'Morelia', 'Nogales', 'Nuevo Laredo', 'Oaxaca', 'Orizaba', 'Pachuca', 'Playa del Carmen', 'Puebla', 'Puerto Vallarta', 'Quertaro', 'Reynosa', 'Rosarito', 'Salamanca', 'Saltillo', 'San Cristbal de las C', 'San Jos del Cabo', 'San L Ro Colorado', 'San Luis Potos', 'Tampico', 'Tapachula', 'Taxco', 'Tecate', 'Tehuacn', 'Tepeji del Ro', 'Tijuana', 'Tlaxcala', 'Toluca', 'Torren', 'Tuxpan', 'Tuxtla Gutirrez', 'Uriangato', 'Uruapan', 'Veracruz', 'Villahermosa', 'Xalapa', 'Zamora', 'Ciudad']

html-css python

3 Contributors
11 Replies
351 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by Huakalero

All 11 Replies

Gribouillis 1,391 Programming Explorer

14 Years Ago

You can add this method to handle accented characters

def handle_charref(self, name):
    if self.this_is_the_tag and not self.end_of_city:
      print "charref", repr(name)
      self.this_city += unichr(int(name))

There is still a problem with your web page. HTMLparser exits with an error (after finding the cities). You may consider using a parser for invalid html, like beautifulsoup (or lxml + beautifulsoup)

Gribouillis 1,391 Programming Explorer

14 Years Ago

Hi, Thanks that really works. I'am glad my code was easy understandable.
The first I tried was BeautifulSoup, since a read it was easier. but when i run it i get: "HTMLParseError: malformed start tag, at line 1475, column 15"
I know HTMLParser is also giving me errors but at least I can still work with the document while BS is not. Maybe this has something to do with the fact that am using python 2.6?

The malformed starttag is

<area shape="rect" coords="324, 28, 445, 41" href="mailto:contacto@cinepolis.com" 

        target"_blank" />

There is a missing = sign between 'target' and "_blank" in the tag.

I was able to parse the file with beautifulsoup without errors

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://cinepolis.com.mx/index.aspx'
req = urlopen(url)
data = req.read()
soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
stuff = soup.findAll(id="ctl00_ddlCiudad")[0]
opts = stuff.findAll("option")
cities = [x.contents[0] for x in opts]
for x in cities:
    print x

"""  my output -->
Acapulco
Aguascalientes
Cabo San Lucas
Cancún
Cd. Acuña
Cd. Cuauhtémoc 
Cd. Juárez
Cd. Obregón
Cd. Victoria
Celaya
Chetumal
Chihuahua
Chilpancingo
Coatzacoalcos
etc
"""

Gribouillis 1,391 Programming Explorer

14 Years Ago

Hi, like I said, this is an unofficial API that is why I can't correct the web page.
When I used beautiful soup, I followed the documentation and it only points that the class only uses one parameter, which is the html. This second parameter, what are you using it for? Is there a web page with full documentation on BS?
Thanks for your help.

The documentation is here http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion. Otherwise, here is the documentation for the initializer, obtained with pydoc

|  __init__(self, markup='', parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo='xml', convertEntities=None, selfClosingTags=None, isHTML=False)
     |      The Soup object is initialized as the 'root tag', and the
     |      provided markup (which can be a string or a file-like object)
     |      is fed into the underlying parser.
     |      
     |      sgmllib will process most bad HTML, and the BeautifulSoup
     |      class has some tricks for dealing with some HTML that kills
     |      sgmllib, but Beautiful Soup can nonetheless choke or lose data
     |      if your data uses self-closing tags or declarations
     |      incorrectly.
     |      
     |      By default, Beautiful Soup uses regexes to sanitize input,
     |      avoiding the vast majority of these problems. If the problems
     |      don't apply to you, pass in False for markupMassage, and
     |      you'll get better performance.
     |      
     |      The default parser massage techniques fix the two most common
     |      instances of invalid HTML that choke sgmllib:
     |      
     |       <br/> (No space between name of closing tag and tag close)
     |       <! --Comment--> (Extraneous whitespace in declaration)
     |      
     |      You can pass in a custom list of (RE object, replace method)
     |      tuples to get Beautiful Soup to scrub your input the way you
     |      want.

Edited 14 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Huakalero 0 Newbie Poster · Answer 1 · 2011-06-16T08:59:29+00:00

Hi, Thanks that really works. I'am glad my code was easy understandable.

The first I tried was BeautifulSoup, since a read it was easier. but when i run it i get: "HTMLParseError: malformed start tag, at line 1475, column 15"

I know HTMLParser is also giving me errors but at least I can still work with the document while BS is not. Maybe this has something to do with the fact that am using python 2.6?

Huakalero 0 Newbie Poster · Answer 2 · 2011-06-16T20:54:31+00:00

Hi, like I said, this is an unofficial API that is why I can't correct the web page.
When I used beautiful soup, I followed the documentation and it only points that the class only uses one parameter, which is the html. This second parameter, what are you using it for? Is there a web page with full documentation on BS?

Thanks for your help.

Huakalero 0 Newbie Poster · Answer 3 · 2011-06-16T22:33:54+00:00

Thank you for your help, I didn't read the whole documentation, I just switched from module the first time I encountered a problem with Bs.

Thank you again, I'll mark this as solved.

Huakalero 0 Newbie Poster · Answer 4 · 2011-06-16T22:43:47+00:00

Oops It's me again. I just ran the code you posted and it still gives an error:

Traceback (most recent call last):
  File "soup.py", line 7, in <module>
    soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 1477, column 15

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 5 · 2011-06-16T23:06:43+00:00

Oops It's me again. I just ran the code you posted and it still gives an error:

Traceback (most recent call last):
  File "soup.py", line 7, in <module>
    soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 1477, column 15

This is strange, it works well for me (python 2.6.5 on linux with BeautifulSoup 3.0.8). Perhaps you could try to clean the code with a markupMassage argument

import re
def repl_func(mo):
    print "occurrence found !"
    return 'target="_blank"'
myMassage = [(re.compile(r'target\"_blank\"'), repl_func)]
soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES, markupMassage=myMassage)

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 6 · 2011-06-17T00:46:39+00:00

Gribouillis' code worked for me also from WindowsXP and Python 2.7.1. I did easy_install and got version 3.2.0

Huakalero 0 Newbie Poster · Answer 7 · 2011-06-17T00:47:17+00:00

That is the problem, I am using the 3.1 version of BeautifulSoup. Again I should have read the whole documentation. Thanks for your help,

I am using Trisquel and BS v3.1 is installed by default. I guess I'll have to use this little massage to get the app running in my system and other similar systems as well.

Thanks for your help.

Huakalero 0 Newbie Poster · Answer 8 · 2011-06-17T00:50:23+00:00

The code is good, the problem is that my Trisquel has BS 3.1 installed by default. This massage stuff solved the problem. I shall read the whole BS documentation.

Thanks for the help. :D

HTMLParser is avoiding some characters that are data

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers