Hi all.

I have been doing some HTML scraping and today a fellow coder told me about this thing 'DOM' and how much easier it is to use it rather than manipulating strings as I have been doing. He doesn't know much about it, and that's why I came here asking for your help.

I have been googling about it and seen some example code and I was immediately drawn by it. However, I haven't been able to install a library that allows me to do it. This one - libxml2dom - looked great but I haven't been able to install it on my machine. My OS is Windows.

I have noticed that most of the libraries are focused towards XML, but will they work with HTML, too?

Do you use DOM for your HTML scraping? Which library do you use?

Thanks.

Recommended Answers

All 8 Replies

HTML is a special case of XML, the difference is that in an XML file, you can have arbitrary tags, like <nunos>hello</nunos> . So, if a library can parse XML, it can also parse HTML and other specialisations of XML, like MATHML, etc. That's why most libraries are designed for XML.

My preferred library to read, write and modify XML and HTML is lxml. It parses XML to produce a tree of nodes obeying the ElementTree interface. This tree can be transformed by your code to create new files.

HTML is a special case of XML, the difference is that in an XML file, you can have arbitrary tags, like <nunos>hello</nunos> . So, if a library can parse XML, it can also parse HTML and other specialisations of XML, like MATHML, etc. That's why most libraries are designed for XML.

My preferred library to read, write and modify XML and HTML is lxml. It parses XML to produce a tree of nodes obeying the ElementTree interface. This tree can be transformed by your code to create new files.

Thanks for your reply 'Gribouillis'. It's not the first time you have helped me here in Dani Web and I have been here for only a few days. So thanks.

I was having some trouble with the installation, but I evetually managed to do it and now I am up and running libxml2dom commands. I was able to easily do in 2 minutes what used to take me at least 20 minutes or even more. This library is awesome!

I have just started learning how to work with DOM today and I need some extreme noob questions cleared out. Please bear with me :icon_redface:.

The page I am working on (teste.htm):

<html>
  <head>
    <title>
      Title
    </title>
  </head>
  <body bgcolor = 'FFFFF'>
    <table>
      <tr bgcolor="#EEEEEE">
        <td nowrap="nowrap">
          <font size="2" face="Tahoma, Arial"> <a name="1375048"></a> </font>
        </td>
        <td nowrap="nowrap">
          <font size="-2" face="Verdana"> 8/15/2009</font>
        </td>
      </tr>
    </table>
  </body>
</html>
import libxml2dom

foo = open('teste.htm', 'r')
str1 = foo.read()

doc = libxml2dom.parseString(str1, html=1)

>>> html = doc.firstChild
>>> html.nodeName
u'html'
>>> head = html.firstChild
>>> head.nodeName
u'head'
>>> title = head.firstChild
>>> title.nodeName
u'title'
>>> body = head.nextSibling
>>> body.nodeName
u'body'
>>> table = body.firstChild
>>> table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)
>>> table = body.firstChild.nextSibling #why this works? is there a text element hidden? (2)
>>> table.nodeName
u'table' 
>>> tr = table.firstChild
>>> tr.nodeName
u'tr'
>>> td = tr.firstChild
>>> td.nodeName
u'td'
>>> font = td.firstChild
>>> font.nodeName
u'text' # (1)
>>> font = td.firstChild.nextSibling # (2)
>>> font.nodeName
u'font' 
>>> a = font.firstChild
>>> a.nodeName
u'text' #(1)
>>> a = font.firstChild.nextSibling #(2)
>>> a.nodeName
u'a'

It seems like sometimes there are some text elements 'hidden'. This is probably a standard in DOM I simply am not familiar with this and I would very much appreciate if anyone had the kindness to explain me this.

Thanks.

Here is what you would obtain with lxml

from lxml import etree

teste = open("teste.htm").read()
html = etree.HTML(teste)

def content(element):
  return element.tag, element.text, element.tail, element.attrib

def print_structure(element, tab=""):
  print("%s%s" % (tab, str(content(element))))
  for x in element.getchildren():
    print_structure(x, tab+"  ")

print_structure(html)

""" my output ---->
('html', None, None, {})
  ('head', None, None, {})
    ('title', '\n      Title\n    ', None, {})
  ('body', '\n    ', None, {'bgcolor': 'FFFFF'})
    ('table', None, None, {})
      ('tr', None, None, {'bgcolor': '#EEEEEE'})
        ('td', '\n          ', '\n        ', {'nowrap': 'nowrap'})
          ('font', ' ', '\n        ', {'face': 'Tahoma, Arial', 'size': '2'})
            ('a', None, ' ', {'name': '1375048'})
        ('td', '\n          ', '\n      ', {'nowrap': 'nowrap'})
          ('font', ' 8/15/2009', '\n        ', {'face': 'Verdana', 'size': '-2'})
"""

Here is what you would obtain with lxml

from lxml import etree

teste = open("teste.htm").read()
html = etree.HTML(teste)

def content(element):
  return element.tag, element.text, element.tail, element.attrib

def print_structure(element, tab=""):
  print("%s%s" % (tab, str(content(element))))
  for x in element.getchildren():
    print_structure(x, tab+"  ")

print_structure(html)

""" my output ---->
('html', None, None, {})
  ('head', None, None, {})
    ('title', '\n      Title\n    ', None, {})
  ('body', '\n    ', None, {'bgcolor': 'FFFFF'})
    ('table', None, None, {})
      ('tr', None, None, {'bgcolor': '#EEEEEE'})
        ('td', '\n          ', '\n        ', {'nowrap': 'nowrap'})
          ('font', ' ', '\n        ', {'face': 'Tahoma, Arial', 'size': '2'})
            ('a', None, ' ', {'name': '1375048'})
        ('td', '\n          ', '\n      ', {'nowrap': 'nowrap'})
          ('font', ' 8/15/2009', '\n        ', {'face': 'Verdana', 'size': '-2'})
"""

Once again thanks for another reply from you. Can you tell me how to get the date text for example? Thanks

You could go this way (assuming that you have some information about where is the date)

def iter_tag(element, tag):
  "iterates over all subelements with a given tag"
  for subelement in element.iter():
    if subelement.tag == tag:
      yield subelement

tables = list(iter_tag(html, "table")) # <--- list of all <tables> in the html page

tr = tables[0][0][1] # <--- the second cell in the first row of the first table (assuming the date is here)
print tr[0].text # <--- the text of the first child of the cell (the <font> node)
commented: this is exactly what I was looking for. Thanks! +1

You could go this way (assuming that you have some information about where is the date)

def iter_tag(element, tag):
  "iterates over all subelements with a given tag"
  for subelement in element.iter():
    if subelement.tag == tag:
      yield subelement

tables = list(iter_tag(html, "table")) # <--- list of all <tables> in the html page

tr = tables[0][0][1] # <--- the second cell in the first row of the first table (assuming the date is here)
print tr[0].text # <--- the text of the first child of the cell (the <font> node)

Thanks. That worked just fine. I will keep this thread unsolved, in case I still have more problems with DOM.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.