Python and HTML DOM

Question

nunos 0 Light Poster

14 Years Ago

Hi all.

I have been doing some HTML scraping and today a fellow coder told me about this thing 'DOM' and how much easier it is to use it rather than manipulating strings as I have been doing. He doesn't know much about it, and that's why I came here asking for your help.

I have been googling about it and seen some example code and I was immediately drawn by it. However, I haven't been able to install a library that allows me to do it. This one - libxml2dom - looked great but I haven't been able to install it on my machine. My OS is Windows.

I have noticed that most of the libraries are focused towards XML, but will they work with HTML, too?

Do you use DOM for your HTML scraping? Which library do you use?

Thanks.

python

3 Contributors
8 Replies
797 Views
12 Years Discussion Span
Latest Post 2 Years Ago Latest Post by r8lst

Gribouillis 1,391 Programming Explorer

14 Years Ago

You could go this way (assuming that you have some information about where is the date)

def iter_tag(element, tag):
  "iterates over all subelements with a given tag"
  for subelement in element.iter():
    if subelement.tag == tag:
      yield subelement

tables = list(iter_tag(html, "table")) # <--- list of all <tables> in the html page

tr = tables[0][0][1] # <--- the second cell in the first row of the first table (assuming the date is here)
print tr[0].text # <--- the text of the first child of the cell (the <font> node)

nunos commented: this is exactly what I was looking for. Thanks! +1

r8lst 16 Newbie Poster

2 Years Ago

you can try domonic which is a python DOM... https://github.com/byteface/domonic/

rproffitt commented: 12 years seems a long time to get a answer. +16

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2009-08-19T04:50:31+00:00

HTML is a special case of XML, the difference is that in an XML file, you can have arbitrary tags, like <nunos>hello</nunos> . So, if a library can parse XML, it can also parse HTML and other specialisations of XML, like MATHML, etc. That's why most libraries are designed for XML.

My preferred library to read, write and modify XML and HTML is lxml. It parses XML to produce a tree of nodes obeying the ElementTree interface. This tree can be transformed by your code to create new files.

nunos 0 Light Poster · Answer 2 · 2009-08-19T05:31:58+00:00

HTML is a special case of XML, the difference is that in an XML file, you can have arbitrary tags, like <nunos>hello</nunos> . So, if a library can parse XML, it can also parse HTML and other specialisations of XML, like MATHML, etc. That's why most libraries are designed for XML.
My preferred library to read, write and modify XML and HTML is lxml. It parses XML to produce a tree of nodes obeying the ElementTree interface. This tree can be transformed by your code to create new files.

Thanks for your reply 'Gribouillis'. It's not the first time you have helped me here in Dani Web and I have been here for only a few days. So thanks.

I was having some trouble with the installation, but I evetually managed to do it and now I am up and running libxml2dom commands. I was able to easily do in 2 minutes what used to take me at least 20 minutes or even more. This library is awesome!

nunos 0 Light Poster · Answer 3 · 2009-08-19T09:08:09+00:00

I have just started learning how to work with DOM today and I need some extreme noob questions cleared out. Please bear with me :icon_redface:.

The page I am working on (teste.htm):

<html>
  <head>
    <title>
      Title
    </title>
  </head>
  <body bgcolor = 'FFFFF'>
    <table>
      <tr bgcolor="#EEEEEE">
        <td nowrap="nowrap">
          <font size="2" face="Tahoma, Arial"> <a name="1375048"></a> </font>
        </td>
        <td nowrap="nowrap">
          <font size="-2" face="Verdana"> 8/15/2009</font>
        </td>
      </tr>
    </table>
  </body>
</html>

import libxml2dom

foo = open('teste.htm', 'r')
str1 = foo.read()

doc = libxml2dom.parseString(str1, html=1)

>>> html = doc.firstChild
>>> html.nodeName
u'html'
>>> head = html.firstChild
>>> head.nodeName
u'head'
>>> title = head.firstChild
>>> title.nodeName
u'title'
>>> body = head.nextSibling
>>> body.nodeName
u'body'
>>> table = body.firstChild
>>> table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)
>>> table = body.firstChild.nextSibling #why this works? is there a text element hidden? (2)
>>> table.nodeName
u'table' 
>>> tr = table.firstChild
>>> tr.nodeName
u'tr'
>>> td = tr.firstChild
>>> td.nodeName
u'td'
>>> font = td.firstChild
>>> font.nodeName
u'text' # (1)
>>> font = td.firstChild.nextSibling # (2)
>>> font.nodeName
u'font' 
>>> a = font.firstChild
>>> a.nodeName
u'text' #(1)
>>> a = font.firstChild.nextSibling #(2)
>>> a.nodeName
u'a'

It seems like sometimes there are some text elements 'hidden'. This is probably a standard in DOM I simply am not familiar with this and I would very much appreciate if anyone had the kindness to explain me this.

Thanks.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 4 · 2009-08-19T16:04:29+00:00

Here is what you would obtain with lxml

from lxml import etree

teste = open("teste.htm").read()
html = etree.HTML(teste)

def content(element):
  return element.tag, element.text, element.tail, element.attrib

def print_structure(element, tab=""):
  print("%s%s" % (tab, str(content(element))))
  for x in element.getchildren():
    print_structure(x, tab+"  ")

print_structure(html)

""" my output ---->
('html', None, None, {})
  ('head', None, None, {})
    ('title', '\n      Title\n    ', None, {})
  ('body', '\n    ', None, {'bgcolor': 'FFFFF'})
    ('table', None, None, {})
      ('tr', None, None, {'bgcolor': '#EEEEEE'})
        ('td', '\n          ', '\n        ', {'nowrap': 'nowrap'})
          ('font', ' ', '\n        ', {'face': 'Tahoma, Arial', 'size': '2'})
            ('a', None, ' ', {'name': '1375048'})
        ('td', '\n          ', '\n      ', {'nowrap': 'nowrap'})
          ('font', ' 8/15/2009', '\n        ', {'face': 'Verdana', 'size': '-2'})
"""

nunos 0 Light Poster · Answer 5 · 2009-08-19T17:39:31+00:00

Here is what you would obtain with lxml

from lxml import etree

teste = open("teste.htm").read()
html = etree.HTML(teste)

def content(element):
  return element.tag, element.text, element.tail, element.attrib

def print_structure(element, tab=""):
  print("%s%s" % (tab, str(content(element))))
  for x in element.getchildren():
    print_structure(x, tab+"  ")

print_structure(html)

""" my output ---->
('html', None, None, {})
  ('head', None, None, {})
    ('title', '\n      Title\n    ', None, {})
  ('body', '\n    ', None, {'bgcolor': 'FFFFF'})
    ('table', None, None, {})
      ('tr', None, None, {'bgcolor': '#EEEEEE'})
        ('td', '\n          ', '\n        ', {'nowrap': 'nowrap'})
          ('font', ' ', '\n        ', {'face': 'Tahoma, Arial', 'size': '2'})
            ('a', None, ' ', {'name': '1375048'})
        ('td', '\n          ', '\n      ', {'nowrap': 'nowrap'})
          ('font', ' 8/15/2009', '\n        ', {'face': 'Verdana', 'size': '-2'})
"""

Once again thanks for another reply from you. Can you tell me how to get the date text for example? Thanks

nunos 0 Light Poster · Answer 6 · 2009-08-19T19:12:00+00:00

You could go this way (assuming that you have some information about where is the date)

def iter_tag(element, tag):
  "iterates over all subelements with a given tag"
  for subelement in element.iter():
    if subelement.tag == tag:
      yield subelement

tables = list(iter_tag(html, "table")) # <--- list of all <tables> in the html page

tr = tables[0][0][1] # <--- the second cell in the first row of the first table (assuming the date is here)
print tr[0].text # <--- the text of the first child of the cell (the <font> node)

Thanks. That worked just fine. I will keep this thread unsolved, in case I still have more problems with DOM.